NAVAL 

POSTGRADUATE 

SCHOOL 

MONTEREY,  CALIFORNIA 

DISSERTATION 


CROSS-DOMAIN  NETWORK  FAULT 
LOCALIZATION 

by 

William  D.  Fischer 
June  2009 

Dissertation  Supervisor:  Geoffrey  G.  Xie 


Approved  for  public  release;  distribution  is  unlimited. 


THIS  PAGE  INTENTIONALLY  LEFT  BLANK 


REPORT  DOCUMENTATION  PAGE 

Form  Approved  OMB  No.  0704-0188 

Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instruction, 
searching  existing  data  sources,  gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send 
comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information,  including  suggestions  for  reducing  this  burden, 
to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204, 

Arlington,  Va  22202-4302,  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188)  Washington  DC  20503. 

1.  AGENCY  USE  ONLY  ( Leave  blank) 

2.  REPORT  DATE 

June  2009 

3.  REPORT  TYPE  AND  DATES  COVERED 

Dissertation 

4.  TITLE  AND  SUBTITLE 

Cross-Domain  Network  Fault  Localization 

5.  FUNDING  NUMBERS 

6.  AUTHORS  William  D.  Fischer 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Naval  Postgraduate  School 

Monterey  CA  93943-5000 

8.  PERFORMING 

ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSORING/MONITORING 
AGENCY  REPORT  NUMBER 

11.  SUPPLEMENTARY  NOTES  The  views  expressed  in  this  thesis  are  those  of  the  author  and  do  not  reflect 
the  official  policy  or  position  of  the  Department  of  Defense  or  the  U.S.  Government. 

12a.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  is  unlimited. 

12b.  DISTRIBUTION  CODE 

13.  ABSTRACT(maa;imiim  200  words) 

Prior  research  has  focused  on  intra-domain  fault  localization  leaving  the  cross-domain  problem  largely 
unaddressed.  Faults  often  have  widespread  effects,  which  if  correlated,  could  significantly  improve  fault  local¬ 
ization.  For  both  competitive  and  security  reasons,  domain  managers  hesitate  to  share  fault  observations  even 
when  doing  so  may  significantly  ease  fault  localization.  This  dissertation  presents  a  characterization  of  the 
problem  space  in  terms  of  inference  accuracy,  privacy,  and  scalability,  and  provides  a  framework  to  evaluate  any 
design  in  the  design  spectrum.  This  framework  not  only  explicitly  models  the  inference  accuracy  and  privacy 
requirements  for  discussing  and  reasoning  over  cross-domain  problems,  but  also  addresses  scalability  impacts 
and  facilitates  the  re-use  of  existing  fault  localization  algorithms  while  enforcing  domain  privacy  policies.  The 
dissertation  provides  a  graph-digest-based  approach  with  which  participating  network  domains  can  exchange 
abstracted  graphs  that  represent  network  fault  propagation  models.  The  research  explores  feasibility  of  this 
approach  via  implementation  of  an  inference  graph-based  design  in  a  cross-domain  network  setting.  The  results 
show  a  substantial  improvement  in  cross-domain  fault  localization  accuracy  and  inference  speed  by  using  the 
inference-graph-digest  based  approach. 

14.  SUBJECT  TERMS 

Networking,  Fault  Localization,  Cross-Domain,  Bayesian 

15.  NUMBER  OF 

PAGES  135 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFI¬ 
CATION  OF  REPORT 

Unclassified 

18.  SECURITY  CLASSIFI¬ 
CATION  OF  THIS  PAGE 

Unclassified 

19.  SECURITY  CLASSIFI¬ 
CATION  OF  ABSTRACT 

Unclassified 

20.  LIMITATION 

OF  ABSTRACT 

uu 

NSN  7540-01-280-5500  Standard  Form  298  (Rev.  2-89) 


Prescribed  by  ANSI  Std.  239-18  298-102 


1 


THIS  PAGE  INTENTIONALLY  LEFT  BLANK 


n 


Approved  for  public  release;  distribution  is  unlimited. 

CROSS-DOMAIN  NETWORK  FAULT  LOCALIZATION 


William  D.  Fischer 

Lieutenant  Colonel,  United  States  Army 
B.S.,  College  of  William  and  Mary,  1989 
M.S.,  Naval  Postgraduate  School,  2001 
Submitted  in  partial  fulfillment  of  the 
requirements  for  the  degree  of 

DOCTOR  OF  PHILOSOPHY  IN  COMPUTER  SCIENCE 

from  the 

NAVAL  POSTGRADUATE  SCHOOL 
June  2009 


Author: 

William  D.  Fischer 

Approved  by: 

Geoffrey  G.  Xie 

Professor  of  Computer  Science,  Dissertation  Supervisor 


Craig  H.  Martell  Mikhail  Auguston 

Associate  Professor  of  Associate  Professor  of 

Computer  Science  Computer  Science 


Joel  D.  Young  Craig  W.  Rasmussen 

Lieutenant  Colonel,  United  Associate  Professor  of  Applied 

States  Air  Force  Mathematics 

Assistant  Professor  of 
Computer  Science 

Approved  by: 

Peter  J.  Denning,  Chair,  Department  of  Computer  Science 

Approved  by: 

Doug  Moses,  Associate  Provost  for  Academic  Affairs 


iii 


THIS  PAGE  INTENTIONALLY  LEFT  BLANK 


IV 


ABSTRACT 


Prior  research  has  focused  on  intra-domain  fault  localization  leaving  the  cross¬ 
domain  problem  largely  unaddressed.  Faults  often  have  widespread  effects,  which  if 
correlated,  could  significantly  improve  fault  localization.  For  both  competitive  and 
security  reasons,  domain  managers  hesitate  to  share  fault  observations  even  when 
doing  so  may  significantly  ease  fault  localization.  This  dissertation  presents  a  charac¬ 
terization  of  the  problem  space  in  terms  of  inference  accuracy,  privacy,  and  scalability, 
and  provides  a  framework  to  evaluate  any  design  in  the  design  spectrum.  This  frame¬ 
work  not  only  explicitly  models  the  inference  accuracy  and  privacy  requirements  for 
discussing  and  reasoning  over  cross-domain  problems,  but  also  addresses  scalability 
impacts  and  facilitates  the  re-use  of  existing  fault  localization  algorithms  while  enforc¬ 
ing  domain  privacy  policies.  The  dissertation  provides  a  graph- digest-based  approach 
with  which  participating  network  domains  can  exchange  abstracted  graphs  that  rep¬ 
resent  network  fault  propagation  models.  The  research  explores  feasibility  of  this 
approach  via  implementation  of  an  inference  graph-based  design  in  a  cross-domain 
network  setting.  The  results  show  a  substantial  improvement  in  cross-domain  fault 
localization  accuracy  and  inference  speed  by  using  the  inference-graph-digest  based 
approach. 
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I. 


INTRODUCTION 


Computer  network  faults  happen  frequently  and  finding  the  source  of  failure  is 
a  non-trivial  task.  Although  much  progress  has  been  made  locating  faults  within  net¬ 
work  domains,  finding  faults  that  affect  multiple  domains,  or  cross-domain  fault  local¬ 
ization,  remains  under-researched.  Network  domain  administrators  currently  perform 
fault  diagnosis  in  isolation,  without  benefit  from  evidence  observed  in  other  domains. 
Today’s  highly  connected  networks  need  collaboration  to  locate  complex  failures,  but 
privacy  concerns  tend  to  prevent  cooperation  across  network  boundaries.  This  re¬ 
search  proposes  a  cooperative  approach  in  which  domain  administrators  share  some 
data  to  find  these  elusive  faults,  while  preserving  privacy  for  sensitive  network  domain 
properties. 

Faults  in  a  network  occur  often  and  in  complex  ways,  and  it  is  well-documented 
that  managers  must  respond  to  these  failures  on  a  regular  basis  [19,23,46].  There 
are  a  wide  variety  of  components  in  a  network  that  can  fail  and  fiber  cuts,  router 
mis-conhguration,  and  power  and  maintenance  outages  are  becoming  more  com¬ 
mon  [13,23,48].  Diversity  of  network  elements  within  a  network  continues  to  grow  [8]. 
The  heterogeneity  of  elements  in  a  network  adds  complication  to  all  aspects  of  manag¬ 
ing  a  network.  Dependencies  between  network  elements  are  not  always  deterministic, 
increasing  the  difficulty  of  correlating  observations  about  network  state.  Failure  du¬ 
rations  can  vary,  increasing  the  difficulty  of  correlating  observation  data  and  further 
complicating  diagnosis  [19,48].  Serious  faults  can  be  undetected  and  may  not  be  able 
to  be  rapidly  localized  [23]. 

When  observations  of  network  state  arising  from  a  network  fault  propagate 
across  domain  boundaries,  the  fault  is  described  as  cross-domain.  Troubleshooting 
faults  is  a  challenging  task — it  is  even  more  difficult  when  trying  to  troubleshoot 
cross-domain  issues  without  knowledge  of  fault  observations  and  network  structure 
from  neighboring  network  domains.  Acquiring  knowledge  of  the  needed  observations 
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Domain  2 


Data  flow  for 
workgroup  in 
Domain  1  transits 
Domain  2  to  reach 
server  in  Domain  3. 


Figure  1.1.  Simple  Failure  Scenario 


and  network  topology  is  further  complicated  by  the  fact  that  it  is  risky,  for  both 
competitive  and  security  reasons,  for  domain  managers  to  share  this  information  even 
when  the  sharing  might  ease  fault  localization.  With  business  processes  migrating 
to  web-services,  implemented  in  the  “cloud”  and  built  on  protocols  such  as  SOAP 
(Simple  Object  Access  Protocol),  the  likelihood  of  network  faults  impacting  multiple 
domains  approaches  unity. 

Faults  often  have  widespread  effects,  which  if  correlated,  can  significantly  in¬ 
crease  fault  localization  accuracy.  This  research  defines  inference  gain  to  be  the 
increase  in  inference  accuracy  achieved  by  correlating  additional  evidence.  Cross¬ 
domain  network  failures  can  not  always  be  localized  without  a  coordinated  effort 
between  domains.  Consider  the  simple  failure  scenario  depicted  in  Figure  1.1.  A 
work  group  in  Domain  1  must  access  data  from  a  server  in  Domain  3  requiring 
connectivity  through  Domain  2.  Unfortunately,  one  of  the  routers  in  Domain  2  is 
misconhgured.  Other  groups  and  services  can  reach  the  server  in  Domain  3,  but 
users  in  Domain  l’s  work  group  can  not.  Furthermore,  no  equipment  failures  along 
the  path  from  the  work  group  (Domain  1)  to  the  server  (Domain  3)  trigger  alarms. 
This  is  difficult  to  troubleshoot  without  cross-domain  collaboration,  often  resulting  in 
“finger  pointing.”  While  the  fault  remains  unabated  and  potentially  unnoticed,  there 


2 


may  be  observations  external  to  each  domain  that  could  help  detect  and  localize  the 
fault.  Overcoming  obstacles  to  cross-domain  collaboration  can  realize  inference  gain 
to  stamp  out  otherwise  ambiguous  network  errors. 

The  cross-domain  environment  introduces  a  source  of  complex  potential  fail¬ 
ures.  There  are  more  than  10,000  autonomous  systems  (ASs)  in  the  Internet  today, 
each  applying  local  policies  for  route  selection  [42],  Domains  in  the  Internet  today 
are  loosely  coupled  [13],  and  since  cross-domain  flows  depend  on  network  elements 
in  more  than  one  domain,  no  single  domain  has  complete  control  of  all  risks  to  the 
flows.  Operators  make  manual  changes  in  routing  policies  without  fully  understand¬ 
ing  the  impact  to  other  domains,  and  business  arrangements  may  restrict  traffic  flow 
between  Autonomous  Systems  [14],  Links  between  domains  are  common  points  of 
congestion,  and  traffic  engineering  between  domains  is  often  achieved  through  trial 
and  error  [14],  Traffic  engineering  across  domains  is  significantly  more  complicated 
than  traffic  engineering  within  a  domain  [14] .  This  reality  implies  that  complexity  in 
performing  fault  localization  across  domains  also  increases. 

Privacy,  scalability,  and  interoperability  issues  hinder  efforts  to  achieve  accu¬ 
rate  cross-domain  fault  localization.  While  prior  work  has  stated  the  importance  of 
these  issues  [18,25,30,44],  review  of  the  literature  did  not  find  a  formal  definition 
of  requirements  addressing  them.  These  same  issues  prevent  network  collaboration 
for  other  types  of  inference  [27,47].  Network  domain  managers  are  often  unwilling  or 
not  permitted  to  share  detailed  internal  network  architectures  and  quality-of-service 
issues  with  outside  agencies,  running  face-first  into  the  need  to  share  data  to  success¬ 
fully  troubleshoot  networking  issues.  Automated  techniques  for  finding  faults  across 
a  large  number  of  domains  face  serious  computational  issues  and  exact  computation 
using  belief  networks  is  NP-hard  [19].  Interoperability  in  network-management  and 
fault-isolation  techniques  is  a  perennial  problem:  Different  modeling  techniques  and 
tools  using  different  algorithms  will  be  employed  in  various  domains.  Conflict  of  in¬ 
formation  formats  and  semantics  may  arise  between  domains,  with  each  domain’s 
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model  assigning  a  different  value  to  the  same  parameter.  There  is  a  dramatic  need 
for  methods  enabling  cross-domain  fault  localization  efficiently  while  minimizing  the 
need  to  share  sensitive  proprietary  information. 

Cross-domain  fault  isolation  efforts  are  hampered  by  privacy  issues.  Domain 
managers  may  be  reluctant  to  share  the  details  of  their  network  dependency  graph 
data  with  other  domain  managers.  ISPs  do  not  want  statistics  on  the  number  of 
problems  observed  in  their  network  publicly  available,  and  do  not  want  to  share 
their  topology  and  state  information  with  their  competitors  [25,27,44],  Different 
network  providers  have  proprietary  network  fault  management  systems  without  open 
interfaces  [18].  Each  domain  manager  will  have  some  information  about  another 
network  domain,  such  as  shared  peering  points,  public  web  services,  and  publicly 
available  company  information.  The  vast  majority  of  components  in  another  network, 
however,  can  only  be  modeled  as  a  cloud.  Network  providers  are  likely  considered 
competitors  and  any  benefit  attained  by  collaboration  to  localize  faults  must  outweigh 
the  cost  of  revealing  internal  details  to  the  competition. 

A  cross-domain  fault  localization  approach  may  not  scale.  There  is  no  central 
fault  management  system  for  the  Internet  and  combining  data  to  resolve  a  cross¬ 
domain  failure  scenario  with  a  centralized  model  may  not  be  realistic.  Consider  a 
scenario  in  which  a  backbone  link  failure  has  impacted  many  domains.  In  the  worst 
case  much  of  the  Internet  may  need  to  be  mapped  into  a  fault  propagation  model. 
Traditional  debugging  tools  do  not  scale  across  administrative  domains  [30].  A  typical 
tier-1  network  (also  known  as  an  Internet  backbone  network  [24])  has  approximately 
1,000  routers,  supported  by  two  orders  of  magnitude  more  access  and  core  transport 
network  elements  [23]. 

To  interoperate,  a  cross-domain  approach  must  overcome  the  heterogeneity  of 
existing  fault  localization  approaches.  Even  if  domain  managers  collaborate  for  the 
purpose  of  fault  localization,  they  may  use  different  methods  of  data  representation 
and  interpretation.  Different  domains  may  be  using  different  inference  algorithms, 
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different  tools,  and  possibly  different  data  schema.  Each  domain  may  employ  a  fault 
management  system  that  is  fundamentally  different  from  those  used  by  other  domains 
with  whom  they  regularly  interact.  Inference  methods  used  by  fault  management 
systems  vary  widely,  and  the  data  from  one  may  not  have  meaning  in  another. 

Three  general  approaches  are  possible  for  diagnosing  cross-domain  problems. 
The  Erst  of  these,  the  status  quo,  is  isolated  inference.  In  this  approach,  each  do¬ 
main  tries  to  locate  the  fault  without  sharing  data  with  other  domains.  The  second 
approach,  referred  to  as  full  disclosure ,  entails  full  collaboration  and  data-sharing 
between  domains.  A  fault  propagation  model  using  this  approach  is  equivalent  to  a 
global  model  of  all  domains  involved.  While  full-disclosure,  in  general,  is  unrealistic 
because  of  the  privacy  factor  and  for  scalability  reasons,  it  is  included  as  a  baseline 
model  for  studying  inference  gains  achievable  from  information  sharing.  The  third  ap¬ 
proach,  proposed  by  this  research  and  termed  “cooperative” ,  is  to  implement  a  design 
in  the  design  spectrum  that  lies  somewhere  between  isolated  inference  and  full  disclo¬ 
sure.  In  this  third  approach  domains  exchange  limited  information,  e.g.,  summaries  of 
fault  observations,  to  perform  inference  while  protecting  sensitive  information.  This 
research  focuses  on  exploring  the  feasibility  of  the  third  approach. 

A.  PROBLEM  STATEMENT  AND  MAIN  HYPOTHESIS 

Problem  Statement:  Cross-domain  fault  localization  is  an  under-researched 
area  for  which  no  general  approach  currently  exists.  External  evidence  that  can 
improve  inference  accuracy  about  network  faults  is  unavailable  to  domain  inference 
algorithms.  Privacy,  scalability,  and  interoperability  issues  restrict  information  ex¬ 
change  about  these  observations. 

Main  Hypothesis:  It  is  possible  to  construct  a  framework  to  enable  managers 
of  separate  network  domains  to  share  information  and  achieve  inference  gain  while 
quantifying  privacy  preservation  of  sensitive  information. 
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B.  CONTRIBUTIONS 


This  research  makes  the  following  major  contributions  to  the  state-of-art  for 
computer  network  fault  localization: 


•  This  research  provides  a  characterization  of  the  problem  space  for  cross-domain 
fault  localization,  and  provides  explicit  metrics  to  evaluate  any  approach  in 
terms  of  the  core  issues  of  accuracy,  privacy,  and  scalability. 

•  This  research  develops  the  first  concrete  solution  framework  providing  a  feasi¬ 
ble,  general  approach  to  address  cross-domain  fault  localization.  This  frame¬ 
work  describes  a  graph  digest  approach  that  enables  domains  that  use  causal 
graphs  to  model  fault  propagation  to  exchange  summary  inference  informa¬ 
tion. 

•  This  research  provides  a  first  application  of  the  framework  using  intra-domain 
fault  localization  algorithms  to  locate  faults  in  a  cross-domain  setting. 

•  This  research  provides  a  first  heuristic  to  learn  a  network  domain’s  topology 
from  a  bipartite  causal  graph. 


C.  ORGANIZATION 

The  outline  for  the  remainder  of  this  dissertation  is  as  follows: 

•  Chapter  If  discusses  related  work  in  fault  localization,  fault  localization  algo¬ 
rithms,  and  cross-domain  fault  localization. 

•  Chapter  Ilf  describes  the  approach  to  model  the  problem,  including  metrics 
to  evaluate  an  approach. 

•  Chapter  IV  provides  the  evaluation  methodology. 

•  Chapter  V  presents  the  evaluation  results. 

•  Chapter  VI  presents  the  conclusions  for  this  research,  and  suggests  areas  of 
future  work. 
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II. 


RELATED  WORK 


This  chapter  presents  the  state-of-the-art  for  intra-domain  and  cross-domain 
fault  localization.  As  this  research  examines  whether  existing  fault  localization  al¬ 
gorithms  can  be  applied  in  a  cross-domain  context,  a  selection  of  recent  algorithms 
are  presented.  This  chapter  is  organized  as  follows.  First,  the  chapter  discusses  gen¬ 
eral  concepts  of  network  fault  localization.  Second,  the  principle  methods  of  inference 
used  by  recent  fault  localization  approaches  are  surveyed.  In  particular,  SHRINK  and 
SCORE  are  highlighted  because  they  are  used  to  to  evaluate  the  model  in  Chapter  V. 
Third,  solutions  proposed  prior  to  this  effort  for  cross-domain  fault  localization  are 
discussed.  Fourth,  network  tomography,  a  current  approach  used  to  collect  network 
status  data  is  discussed.  Finally,  recent  work  on  graph  anonymization,  which  has 
direct  bearing  on  privacy  preservation  is  described. 

A.  NETWORK  FAULT  LOCALIZATION 

Failures  can  stem  from  many  causes,  including  hardware,  software,  and  con¬ 
figuration  errors.  Errors  may  be  introduced  at  each  stage  of  a  network’s  architectural 
implementation  [23].  Fault  diagnostic  information  is  subject  to  errors  due  to  in¬ 
accurate  models  of  network  dependencies,  missing  observations,  and  spurious  [39] 
observations.  Typically  human  operators  perform  device  configuration,  resulting  in 
potentially  misconhgured  devices. 

The  sheer  number  of  components  in  a  system  increases  the  frequency  of  fail¬ 
ures,  and  the  complexity  in  locating  a  failure.  Network  flows  can  cross  many  com¬ 
ponents  in  a  network  and  a  failure  of  any  one  component  on  the  path  can  sever 
end-to-end  connectivity.  A  typical  tier-1  network  has  roughly  1,000  routers  from 
different  vendors,  having  different  features  and  playing  different  roles  [23]. 

Fault  localization  is  the  second  step  of  fault  diagnosis,  which  consists  of  three 
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steps:  fault  detection,  fault  localization,  and  testing/analysis  [8,37].  Fault  detec¬ 
tion  usually  comes  from  alarms.  Network  administrators  typically  employ  a  fault 
localization  solution  to  arrive  at  likely  hypotheses  to  explain  the  observations  about 
detected  faults.  Fault  localization,  a  major  task  in  maintaining  network  service,  is  the 
process  of  determining  the  actual  faults  responsible  for  observed  problems  in  a  sys¬ 
tem  [41].  The  data  typically  available  to  perform  fault  localization  includes  potential 
fault  causes,  observations  of  network  state,  dependencies  between  causes  and  observa¬ 
tions,  prior  probabilities  of  fault  causes,  and  dependencies  between  fault  causes.  Fault 
localization  algorithms  return  a  best  explanation,  which  is  a  set  containing  the  most 
likely  failed  root  cause  or  causes,  given  the  model  and  available  evidence.  Ideally,  a 
best  explanation  precisely  matches  the  ground  truth. 

Domain  managers  typically  employ  a  fault  management  system,  instrumented 
with  alarms,  that  infers  the  best  explanation  for  observed  alarms  based  on  the  network 
dependencies.  These  dependencies  are  stored  in  a  shared  risk  database,  representing 
the  network  components  subject  to  failure  [19].  Many  of  the  alarms  are  based  on  Sim¬ 
ple  Network  Management  Protocol  (SNMP)  trap  messages,  and  Traceroute  and  ping 
results.  SNMP  uses  UDP  to  send  trap  messages  based  on  state  variable  thresholds. 
Since  UDP  is  an  unreliable  protocol,  not  all  trap  messages  will  reach  the  network 
management  system,  resulting  in  lost  observations. 

The  principle  of  Occam’s  Razor  is  fundamental  to  network  fault  localiza¬ 
tion  [23].  Considering  the  prior  probability  of  any  component  failing,  it  is  more 
likely  that  fewer  conditionally  independent  components  can  explain  observed  failures. 
Considering  a  typical  network  component  failure  probability  of  1CD5  in  any  given 
hour  [19],  the  probability  that  multiple  independent  components  have  simultaneously 
failed  decreases  significantly  with  the  number  of  hypothesized  simultaneously  failed 
nodes. 

Fault  localization  is  further  complicated  by  the  existence  of  errors  in  the  data. 
These  errors  include  inappropriate  prior  probabilities,  incorrect  or  inappropriate  de- 


pendency  mappings,  and  erroneous  observations  of  network  state.  Prior  probabilities 
used  may  not  be  representative  of  the  actual  failure  rates.  Dependency  mappings 
are  constructed  either  through  human  user  input  data  or  by  analyzing  traffic  flow 
on  a  network.  Observations  of  network  state  are  determined  through  client  report¬ 
ing,  SNMP  traps,  or  by  using  probing  tools  such  as  Traceroute.  Intuitively,  any 
method  that  relies  on  human  operator  input  is  subject  to  error.  Changes  in  network 
configuration  are  not  always  updated,  resulting  in  incorrect  dependency  mappings. 
Observation  nodes  are  subject  to  false  negative  observations,  such  as  lost  SNMP  pack¬ 
ets,  and  false  positive  observations,  such  as  spurious  symptoms  [37]  in  the  network. 
Errors  must  be  modeled  appropriately  in  any  network  fault  localization  approach  to 
identify  root  causes  that  best  explain  observed  evidence.  Ineffective  error  modeling 
can  lead  to  incorrect  identification  of  root  causes  of  network  failures,  which  in  turn 
leads  to  increased  downtime  for  the  network  and  resources  expended  to  implement 
failure  recovery. 

B.  FAULT  LOCALIZATION  ALGORITHMS 

Before  describing  specific  fault  localization  algorithms,  this  section  first  sum¬ 
marizes  a  few  core  underlying  concepts  used  by  recent  approaches,  including  assump¬ 
tions  and  heuristics  making  otherwise  intractable  algorithms  practical  for  finding 
identifying  faults. 

The  minimum  set-cover  problem  is  known  to  be  NP-Complete  [10].  An  in¬ 
stance  of  this  problem  consists  of  a  finite  set  X  and  a  family  of  subsets  Y  such  that 
each  element  of  X  belongs  to  at  least  one  subset  Y.  A  solution  to  this  problem  is 
the  minimum  number  of  subsets  Y  such  that  the  union  of  these  subsets  contain  all 
elements  of  X.  Modeling  possible  causes  of  failure  as  the  cover  set  Y  and  observations 
of  failure  as  the  set  X ,  as  done  by  one  of  the  studied  fault  localization  approaches  [23], 
identifies  the  least  number  of  failures  to  explain  the  state  of  observations  about  fail¬ 
ure.  Although  ideal  in  its  application  of  Occam’s  razor,  by  itself  a  minimum-set  cover 
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approach  lacks  a  mechanism  to  leverage  prior  probabilities  and  non-binary  causal 
dependencies. 

Bayes’  rule  allows  diagnostic  inference  to  reason  about  causes,  given  the  evi¬ 
dence: 

=  PrWBJPHB) 

V  1  ;  Pr(A ) 

In  computer  networking,  the  probability  of  observing  a  failure  given  a  component 
has  failed  may  be  estimated  or  directly  measured.  That  knowledge  can  be  used  to 
leverage  the  power  of  Bayes’  rule  to  derive  the  probability  that  a  component  has 
failed,  given  the  state  of  an  observation. 

Many  inference  methods,  such  as  Bayesian  inference,  bound  the  number  of 
simultaneous  failures  [19].  Reasonable  independence  assumptions  and  heuristics  can 
reduce  the  complexity  of  this  NP-Hard  problem  [31]  with  little  sacrifice  in  accuracy. 
Assumptions  about  the  maximum  number  of  failed  components  based  on  the  high 
reliability  of  networking  components  also  serve  to  reduce  complexity  [4, 19] .  In  general, 
in  the  networking  domain  it  is  reasonable  to  assume  independence  between  failure 
causes  (e.g.  a  router  fault  on  one  side  of  a  network  domain  says  nothing  about 
whether  a  cable  is  cut  on  the  other  side)  [4,19].  Additionally,  greedy  approaches 
can  achieve  good  results  in  practice,  as  does  one  of  the  studied  fault  localization 
algorithms,  SCORE  [23]. 

A  survey  conducted  by  Steinder  and  Sethi,  2004  [37]  classifies  techniques  used 
in  fault  localization,  dividing  them  into  three  broad  categories:  expert  systems,  model 
traversing,  and  graph  theoretic  techniques.  Expert  systems  attempt  to  mimic  a  hu¬ 
man  expert  to  solve  problems  within  a  domain.  Model  traversing  techniques  represent 
network  entities  and  their  relationships,  and  then  traverse  the  model  graph  to  cor¬ 
relate  alarms  and  locate  faults.  Graph  theoretic  techniques  use  a  fault  propagation 
model  to  describe  entities  and  conditional  dependencies  between  them  in  a  depen¬ 
dency  graph. 
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The  survey  further  divides  the  graph  theoretic  techniques  into  five  categories. 
Divide  and  conquer  algorithms  cluster  dependencies  in  a  dependency  graph,  then 
recursively  subdivide  the  clusters  containing  nodes  explaining  failure  observations. 
This  process  continues  until  a  singleton  having  the  highest  probability  of  explain¬ 
ing  the  failed  observations  is  derived.  Context-free  grammar  approaches  represent 
network  components  as  terminals,  and  use  productions  to  capture  network  depen¬ 
dencies.  Codebook  techniques  are  represented  with  a  matrix  of  problem  codes  that 
can  be  used  to  “look  up”  the  cause  given  the  observed  effects.  Bayesian  network  ap¬ 
proaches  [4, 19]  use  directed  acyclic  graphs  (DAG)  in  which  nodes  represent  random 
variables  modeling  the  state  of  network  elements,  and  edges  representing  conditional 
probabilities  [37].  Finally,  bipartite  causal  graph  models  [23,36,40]  use  bipartite  fault 
propagation  models  to  represent  cause  and  effect  relationships. 

A  current  trend  is  to  model  the  problem  as  a  DAG  having  root  causes  as 
parentless  (root)  nodes,  observations  as  childless  (leaf)  nodes,  and  dependencies  as 
directed  edges  in  the  graph.  These  edges  express  conditional  probabilities  between 
elements  in  a  network,  and  allow  determining  the  conditional  probability  table  for 
each  node.  This  graph  structure  is  also  known  as  a  causal  graph  [37].  The  solution 
approaches  using  a  causal  graph  typically  perform  probabilistic  inference  on  the  con¬ 
structed  dependency  graphs.  Most,  if  not  all,  network  fault  propagation  models  can 
be  transformed  and  represented  with  a  causal  graph. 

Root  causes  to  network  failure  are  also  known  as  shared  risk  groups  (SRGs) 
[25].  Shared  risks  are  typically  hardware  components  that  can  fail  and  are  represented 
by  the  set  of  nodes  that  are  dependent  on  the  shared  risk  [23].  In  a  bipartite  causal 
graph  all  SRG  nodes  are  root  nodes,  and  members  of  an  SRG  set  are  observation  nodes 
represented  graphically  by  directed  dependency  edges  from  the  SRG  nodes  to  their 
member  observation  nodes.  In  this  dissertation,  observation  nodes  and  observations 
are  used  interchangeably. 

Recent  approaches  exhibiting  these  techniques  include  SCORE  (non  proba¬ 
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bilistic),  SHRINK,  and  Sherlock.  SHRINK  and  SCORE  use  bipartite  graph  repre¬ 
sentations  of  fault  propagation  models.  Sherlock  is  the  first  system  to  expand  the 
approach  to  use  multilevel  dependency  graphs.  In  each  case  the  network  dependency 
graph  fed  to  the  algorithm  is  a  causal  graph.  Although  SHRINK,  SCORE,  and  Sher¬ 
lock  have  many  differences,  they  can  all  use  a  bipartite  causal  graph,  and  all  return 
a  best  explanation. 


o 

(a)  Physical  topology  (b)  IP  View 

Figure  2.1.  Example  network. 

To  illustrate  the  SCORE,  SHRINK,  and  Sherlock  algorithms,  consider  the 
simple  network  depicted  in  Figure  2.1.  Figure  2.1(a)  depicts  the  network  physical 
topology,  in  which  IP  routers  A,  B,  and  C  are  connected  across  fibers  ip  -  F4  and 
optical  cross-connects  X\  and  A2.  Each  IP  router  has  a  an  IP  link  to  each  other 
router  as  shown  in  Figure  2.1(b).  If  any  of  the  optical  components,  fibers,  or  optical 
cross-connects  fail,  the  IP  routers  will  detect  link  failures.  The  prior  SRG  failure 
probabilities  are  1CT4  and  1CT6  for  the  fibers  and  the  cross-connect  respectively. 
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The  causal  graph  in  Figure  2.2  provides  a  visual  representation  of  fault  prop¬ 
agation  for  the  network  in  Figure  2.1.  Each  hardware  component  that  can  fail  (fibers 
F\  ...  F4  and  switches  X\  and  Xf)  are  modeled  as  SRGs,  and  the  IP  links  (L\ . . .  Lf) 
are  modeled  as  observation  nodes.  The  edge  strength  of  each  edge  in  the  graph  de¬ 
picts  the  probability  that  an  IP  link  will  observe  failure  given  that  the  SRG  has  failed. 
To  illustrate,  the  edge  from  F\  to  L\  reflects  the  probability  (unlabeled  edges  have  an 
edge  weight  of  1.0)  that  L4  will  observe  failure  given  has  failed  with  a  probability 
of  1.0. 

1.  SCORE 

Kompella  et  al.  introduced  SCORE  [23]  in  2005.  SCORE  applies  a  greedy 
minimum  set  cover  technique  to  perform  inference  on  a  bipartite  DAG.  Two  of  the 
strengths  of  the  SCORE  algorithm  are  its  inference  speed  and  its  adherence  to  the 
Occam’s  razor  principle.  SCORE  does  not  use  probability  distributions,  however,  and 
therefore  may  not  be  the  best  algorithm  to  use  when  probability  distribution  data 
is  available.  SCORE  addresses  errors  through  a  hit-ratio  threshold.  This  threshold 
represents  the  allowable  false  positive  ratio  that  a  hypothesis  must  not  exceed  to  be 
a  considered  as  a  candidate  explanation. 

Each  potential  failure  root  cause  node  is  represented  as  a  parentless  node, 
and  each  observation  node  is  represented  as  a  childless  node  in  the  graph.  The 
dependencies  from  parent  nodes  to  child  nodes  are  set  to  one.  Each  root  cause 
node  has  a  derived  hit  ratio  that  reflects  the  percentage  of  this  node’s  children  that 
have  observed  failure,  and  coverage  ratio  that  reflects  the  percentage  of  remaining 
unexplained  failures  that  can  be  accounted  for  by  the  failure  of  this  root  cause  node. 
The  hit  and  cover  ratios  equal  1  —  false  positive  ratio  and  1  —  false  negative  ratio 
respectively.  Let  O  represent  the  set  of  observation  nodes  that  have  observed  failure. 
Let  Si  represents  the  ith  root  cause  node,  and  let  0*  be  the  set  of  observation  nodes 
that  will  report  failure  given  that  Si  has  failed.  The  hit  ratio  for  Si  is  equal  to 
|OjflO|/|Oj|  and,  once  computed  for  each  root  cause,  does  not  change  for  the  duration 
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SRG 

Links 

Hit  Ratio 

Cover  Ratio 

Fi 

L\ ,  L3 

0.5 

0.5 

W 

L\ ,  L3 

0.5 

0.5 

f2 

L\ ,  L3 

0.5 

0.5 

x2 

Li,L2,  L3 

0.67 

1.0 

f3 

L 1,  L2 

1.0 

1.0 

F4 

l2i  l3 

0.5 

0.5 

Table  2.1.  Hit  and  Cover  ratios  calculated  for  observations  L\  =  down,  L2  = 
down,L3  =  up  on  the  example  network  in  Figure  2.1. 

of  the  algorithm  execution.  The  coverage  ratio  for  Si  is  computed  using  \Oi  fl  0|/|0| 
and  is  updated  with  each  iteration  of  the  algorithm.  With  each  iteration  of  the 
SCORE  algorithm  the  root  cause  node  with  the  maximum  coverage  ratio,  and  having 
a  hit  ratio  equal  to  or  greater  than  the  input  threshold  value,  is  added  to  a  hypothesis 
vector  and  the  observations  associated  with  the  root  cause  are  explained  and  removed 
from  the  set  O.  The  algorithm  continues,  adding  root  cause  nodes  to  the  hypothesis 
vector  until  the  observation  set  O  is  empty. 

Table  2.1  shows  an  example  of  using  SCORE  for  the  network  depicted  in 
Figure  2.1.  Consider  the  scenario  where  IP  links  L\  and  L2  are  observed  to  be 
down,  and  L3  is  observed  to  be  up.  Intuitively,  the  cause  is  most  likely  the  failure 
of  fiber  link  F3.  On  the  SCORE  algorithm’s  first  pass  with  a  threshold  setting  of 
1.0,  only  F3  has  a  hit  ratio  of  1.0,  so  the  algorithm  adds  F3  to  the  hypothesis  set. 
The  hypothesis  completely  explains  the  observations,  so  the  algorithm  returns  F3  as 
the  root  cause  for  the  failure  scenario.  In  a  more  complicated  scenario,  possibly  with 
multiple  failures  and  observation  errors,  SCORE  uses  a  cost  function  that  considers 
the  number  of  SRGs  in  the  hypothesis  and  the  threshold  setting  used  for  the  execution 
of  the  algorithm  to  determine  the  best  explanation. 

2.  SHRINK 

Srikanth  Kandula  et  al.  developed  SHRINK  [19]  in  2005  to  perform  approxi¬ 
mate  Bayesian  inference  on  a  bipartite  causal  graph.  One  of  the  greatest  strengths 
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of  the  Shrink  algorithm  is  that  it  can  return  the  probability  that  a  hypothesis  has 
caused  the  observed  failures  for  each  hypothesis  considered  if  the  belief  values  are 
normalized.  Another  of  SHRINK’s  main  contributions  is  its  robustness.  Shrink  mit¬ 
igates  potential  errors  in  mapping  conditional  probabilities  and  in  observation  state 
reporting  by  reducing  each  conditional  dependency  by  a  noise  value,  then  adding 
edges  with  the  same  noise  value  to  form  a  complete  directed  bipartite  graph. 

The  SHRINK  model  assumes  independent  failures  of  root  cause  nodes  and 
that  no  more  than  three  SRGs  will  fail  simultaneously  in  a  large  network  based  on 
the  extremely  low  likelihood  of  four  or  more  simultaneous  failures.  Noisy-OR  is  used 
to  calculate  the  conditional  probability  table  for  a  node  with  multiple  parents.  The 
SHRINK  algorithm  is  defined  as  follows.  Let  <  Si, ,  Sn  >  denote  a  hypothesis 
vector,  where  S,  —  1  if  a  failure  of  SRG  Si  is  assumed,  and  S'*  =  0  otherwise.  Let 
<  Li,...,Lm  >  denote  an  observation  vector,  where  Lj  =  1  if  a  failure  of  Lj  is 
observed,  and  Lj  =  0  otherwise.  Given  a  particular  observation  vector,  the  SHRINK 
algorithm  searches  through  all  hypothesis  vectors  with  no  more  than  three  assumed 
failures,  and  returns  those  maximizing  the  posterior  probability 

argmax  Pr(<  Si, ...,  Sn  >  \  <  Lx, ,  Lm  >). 

<Si,...,S„> 

Consider  the  example  network  in  Figure  2.1  again.  Recall  that  the  causal 
graph  has  six  optical  components  mapped  to  SRGs  F\ . . .  F4.  O i,  and  O 2.  To  account 
for  potential  database  and  observation  errors  a  noise  value  (10-4)  is  subtracted  from 
the  conditional  probability  of  each  edge  in  Figure  2.2,  and  noisy  edges  with  this  same 
value  are  added  to  form  a  complete  bipartite  graph.  E.g.,  Probability  (Li\F\)  is  0.9999 
while  Probability(L2\Fi)  =  10~4. 

Suppose  Li  and  L2  are  down,  and  L3  is  up.  As  described  above,  SHRINK 
only  considers  hypothesis  vectors  with  at  most  three  total  assumed  failures.  For 
this  six  SRG  example  SHRINK  searches  through  J2’l=o  it)  =  42  hypotheses,  with 
hypothesis  vector  <  0,  0, 0,  0, 1,  0  >  maximizing  the  posterior  probability  for  the  given 
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observations.  SHRINK  correctly  identifies  SRG  F3  (i.e.,  the  failure  of  fiber  link  F3) 
to  be  the  root  cause. 

3.  Sherlock 

Paramvir  Bahl  et  al.  introduced  Sherlock  [4]  in  2007.  The  Sherlock  fault 
localization  system  uses  an  inference  algorithm  called  Ferret.  One  of  the  strengths 
of  Sherlock’s  Ferret  algorithm  is  that  it  can  run  on  a  multi-level  graph.  However,  by 
not  considering  prior  probabilities  of  SRGs,  Sherlock  does  not  adhere  to  the  principle 
of  Occam’s  razor.  One  of  the  main  contributions  of  Sherlock  is  in  directly  measuring 
conditional  dependencies  in  a  network  to  populate  its  fault  propagation  model.  This 
data  collection  mitigates  the  lack  of  prior  probabilities  in  its  inference  model  and 
reduces  the  risk  of  human-introduced  errors  in  its  SRG  databases.  Like  SHRINK,  a 
noise  value  is  subtracted  from  all  root  cause  dependencies  that  affect  a  network  path. 

Ferret  applies  Breadth-First-Search  (BFS)  to  a  causal  graph,  propagating  val¬ 
ues  down  to  the  observation  nodes.  Ferret  can  run  on  a  multi-level  DAG,  and  con¬ 
ditional  dependencies  between  root  causes  are  represented  by  meta-nodes  inserted 
into  the  graph.  In  addition  to  up  and  down  states,  Sherlock  can  compute  the  belief 
that  a  root  cause  node  is  in  a  troubled  state:  up  but  experiencing  a  performance 
degradation.  Like  Shrink,  up  to  3  simultaneous  failures  are  hypothesized  using  Fer¬ 
ret.  Each  hypothesis  vector  is  set  with  a  permutation  of  root  causes  in  either  the 
up  or  down  state.  The  probabilities  at  each  node  are  propagated  down  the  graph  to 
the  leaf  nodes,  using  noisy-OR  computations.  The  equations  used  to  determine  the 
probabilities  that  a  child  node  is  in  different  states  are 

P(child  up)  =  J]((l  -  dj)  *  (tfouble  +  pfwn)  +  pjp), 

3 

P (child  down )  =  1  -  1  -  pf,wn  +  (1  -  dj)  *  pfrwn),  and 

j 

P {child  troubled )  =  1  —  (P  (child  up)  +  P (child  down)), 
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where  dj  represents  causal  dependency  P(child\parentj),  and  pu-r\  pdown  ,  and  ptroubled 
denote  the  the  probability  of  the  jth  parent  node  being  up,  down,  and  troubled 
respectively. 

Once  all  probabilities  have  propagated  to  the  observation  nodes  for  a  hypoth¬ 
esis,  the  hypothesis  score  is  calculated  by  taking  the  product  of  probabilities  of  the 
observation  node,  where  the  value  used  for  a  node  is  P(child  up)  if  the  node  reports 
up,  P(child  down )  if  the  node  reports  down,  and  P(child  troubled )  if  the  node  reports 
troubled.  [4]  The  hypothesis  with  the  highest  score  is  the  best  explanation  of  root 
causes  given  the  observations  of  state. 

Returning  to  the  illustration  in  Figures  2.1  and  2.2  for  the  observations  L\  and 
L2  down  and  L3  up,  Sherlock  considers  Ylk=o  (D  =  42  hypotheses.  Sherlock  assigns 
edge  strengths  between  a  router  and  path  at  1  —  10-5.  Assigning  this  value  to  all 
edges  yields  a  causal  graph  similar  to  that  used  by  SHRINK,  less  prior  probabilities 
and  noisy  edges.  The  algorithm  returns  hypothesis  F3  as  the  best  explanation  with 
a  score  of  0.9998. 

C.  CROSS-DOMAIN  FAULT  LOCALIZATION 

Cross-domain  fault  localization  is  correlating  observations  from  multiple  do¬ 
mains  to  determine  the  best  explanation  for  detected  faults.  Cross-domain,  multi- 
domain,  and  inter-domain  are  synonymous  terms  found  in  the  literature.  When  data 
is  required  from  multiple  domains  to  consistently  identify  the  cause  of  network  faults, 
a  cross-domain  solution  is  needed.  A  study  of  routing  instability  found  that  all  parties 
pointed  to  another  party  as  the  cause  in  about  10%  of  the  problems  [44],  Despite  its 
importance,  little  work  has  been  done  to  address  fault  localization  across  administra¬ 
tive  domains  [44], 

High-level  approaches  to  model  cross-domain  fault  localization  and  solutions 
with  limited  scope  to  address  the  problem  have  been  proposed.  Proposed  approaches 
to  model  collaboration  for  cross-domain  fault  isolation  are  centralized,  decentralized, 
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and  distributed  strategies  [21,38].  Katzela  et  al.  described  three  general  approaches 
to  cross-domain  fault  localization  collaboration:  centralized,  decentralized,  and  dis¬ 
tributed  [21],  In  the  centralized  approach,  a  single  entity  has  a  global  view  of  the 
network  and  performs  inference  on  behalf  of  all  domains  involved.  The  centralized 
approach  introduces  a  single  point  of  failure  for  troubleshooting,  and  is  both  inef¬ 
ficient  and  inflexible  [35].  A  centralized  scheme  as  described  by  Katzela  is  akin  to 
the  full  disclosure  approach  described  in  this  research,  and  is  therefore  not  practical 
for  privacy  reasons.  In  the  decentralized  approach,  a  central  manager  oversees  all 
domain  managers.  When  failures  affect  more  than  one  domain,  this  central  manager 
coordinates  cross-domain  collaboration  between  the  domains  [21].  In  the  distributed 
approach,  each  network  is  partitioned  into  logically  autonomous  systems.  This  ap¬ 
proach  is  similar  to  isolated  inference,  and  includes  abstract  representations  of  exter¬ 
nal  root  causes  that  could  affect  internal  observations  [21],  A  distributed  approach 
is  well  suited  for  fault  localization  when  effects  from  faults  do  not  propagate  across 
network  domains  [5]. 

Existing  cross-domain  approaches  tend  to  either  rely  on  full  cooperation  from 
involved  domains  [18,35,38],  or  on  passively  monitoring  traffic  and  actively  probing 
to  infer  network  state  [2,33].  In  the  past,  researchers  have  used  routing  and  up¬ 
date  messages,  or  distributed  probing  to  identify  cross-domain  failures  [48].  Some 
suggested  techniques  are  based  on  observing  distributed  traffic  [30,48].  Distributed 
fault  localization  techniques  have  been  identified  as  an  open  research  problem  [37].  A 
solution  to  distributed  fault  localization  in  hierarchically  routed  networks  has  been 
proposed  [38],  which  is  discussed  next. 

Steinder  et  ah,  2008  presented  a  cross-domain  fault  localization  approach  for 
hierarchically  organized  networks  that  use  probabilistic  fault  propagation  models  [38] . 
In  this  approach,  a  network  manager  at  the  top  of  a  hierarchy  oversees  and  coordinates 
fault  localization  for  subordinate  domains.  The  approach  relies  on  domains  to  use 
probabilistic  fault  localization  algorithms.  In  the  approach,  each  domain  attempts  to 
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localize  faults  internally  first.  If  the  most  probable  cause  of  a  fault  comes  from  a  proxy 
node  representing  external  domains,  the  domain  manager  requests  inference  from  the 
network  manager.  The  network  manager,  upon  receiving  a  request  for  inference  from 
a  domain  manager,  determines  the  domains  transited  by  the  path  as  reported  by  the 
requesting  domain.  The  network  manager  divides  the  path  into  nodes  representing  the 
transited  domains  and  the  external  links  between  them,  and  provides  each  domain  the 
status  of  the  end-to-end  path  as  an  external  observation.  Each  domain  then  correlates 
this  external  observation  with  their  internal  observations  and  returns  a  probability 
that  the  root  cause  resides  within  the  domain.  If  the  most  probable  explanation  is 
one  of  the  domains,  that  domain  manager  is  responsible  to  find  the  precise  root  cause. 

The  cross-domain  approach  for  hierarchical  networks  does  not  achieve  gener¬ 
ality.  The  approach  relies  on  network  domains  to  fall  into  a  strict  hierarchy,  such 
as  in  either  a  strictly  customer-provider  relationship  or  a  hierarchically  organized  set 
of  domains  under  the  same  authority.  The  approach  explicitly  looks  for  errors  along 
a  path,  meaning  that  the  model  must  contain  all  paths  through  the  domains.  Fi¬ 
nally,  each  domain  must  use  a  probabilistic  fault  localization  algorithm  in  order  to 
collaborate. 

D.  NETWORK  TOMOGRAPHY 

Network  tomography,  a  term  first  used  by  Vardi  in  1996  [45],  uses  a  limited 
subset  of  nodes  to  monitor  a  network,  and  estimate  the  network  status  and  structure 
[7,9,22,26].  Network  tomography  can  help  to  identify  routing  faults  and  congestion 
[9].  However,  implementing  network  tomography  on  a  large  scale  faces  significant 
computational  challenges  [7].  Two  forms  of  network  tomography  in  recent  literature 
in  include  path-level  traffic  intensity  estimation  (also  known  as  passive  tomography) 
and  link- level  parameter  estimation  (also  known  as  active  tomography)  [9,26]. 

With  path-level  parameter  estimation,  nodes  inside  of  a  network  collect  link- 
level  information  measurements  to  estimate  path-level  parameters  [9].  The  informa- 


19 


tion  can  then  be  used  to  estimate  the  traffic  matrix  of  a  network  [26]. 

In  link-level  parameter  estimation,  nodes  (typically  on  the  fringe  of  a  network) 
collect  path-level  measurements  to  estimate  link  parameters  [9].  Alternatively,  the 
data  may  be  gathered  via  probes  into  the  network,  hence  active  tomography  [26]. 
These  measurements  can  be  used  to  characterize  a  network  performance  over  time  [26] . 
In  addition  to  probing  to  measure  network  status,  active  tomography  can  be  used  to 
reveal  a  network’s  hidden  structure  [9]. 

With  the  lack  of  viable  cooperative  cross-domain  fault  localization  solutions, 
active  network  tomography  provides  a  non-cooperative  option  for  a  domain  to  probe 
other  network  domains  to  gather  external  evidence.  While  probing  can  certainly  pro¬ 
vide  valuable  information  about  the  state  and  structure  of  another  network  domain, 
the  information  gleaned  will  not  necessarily  be  of  the  same  quality  as  information 
provided  by  a  collaborative  approach.  Furthermore,  there  is  a  growing  network  se¬ 
curity  trend  to  prevent  network  probing  [7,9],  which  may  reduce  the  effectiveness  of 
active  tomography. 

E.  PRIVACY  CONSIDERATIONS 

A  passive,  or  semi-honest,  adversary  will  follow  specified  protocols  and  attempt 
to  infer  as  much  information  as  possible  from  messages  received  [47].  In  the  context 
of  cross-domain  fault  localization  using  graph  digests,  a  passive  adversary  will  only 
attempt  to  learn  sensitive  information  from  a  graph  digest.  Recent  work  in  data- 
mining  uses  a  semi-honest  collaboration  model  [6,28].  As  in  the  data-mining  work, 
this  research  assumes  a  semi-honest  model. 

An  active,  or  malicious,  adversary  will  not  necessarily  follow  the  protocols,  and 
may  take  measures  to  influence  the  dataset  [47].  An  active  adversary  may  intention¬ 
ally  induce  network  faults  for  the  purpose  of  learning  sensitive  network  properties. 
This  research  assumes  that  collaboration  for  finding  faults  is  not  done  with  active 
adversaries. 
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Graph  anonymization  is  typically  done  using  naive  anonymization,  in  which 
node  labels  are  simply  renamed  creating  a  graph  isomorphic  to  the  original,  non- 
anonymized  graph  [17].  Recent  approaches  to  strengthen  anonymization  of  graphs 
include  random  perturbation  of  edges:  performing  random  edge  deletions  and  inser¬ 
tions  [17].  While  perturbation  helps  to  enhance  privacy  of  a  graph,  these  pertur¬ 
bations  may  cause  degrade  accuracy  for  the  graph  [17],  and  techniques  have  been 
proposed  to  estimate  the  original  data  from  the  perturbed  data  [20].  Narayanan  and 
Shmatikov  successfully  identified  individual  Netflix  records,  in  spite  of  small  data 
perturbations  [29].  Any  approach  to  share  inference  information  for  fault  localization 
must  take  measures  beyond  simple  node  anonymization  if  privacy  is  a  concern. 

Recent  work  to  de-anonymize  graphs  attempt  to  locate  specific  nodes  in  the 
graph  [17].  A  common  technique  to  measure  privacy  for  a  data  set,  is  to  measure 
k-anonymity  as  defined  by  Sweeney  [43].  The  basic  idea  of  k-anonymity  is  to  create 
sets  of  indistinguishable  nodes.  The  cardinality  of  the  smallest  of  these  sets  equals  the 
k-anonymization  level.  Future  work  to  augment  the  generalized  standard  deviation 
metric  presented  in  Chapter  111  with  k-anonymization  may  strengthen  the  proposed 
practical  privacy  protection  approach. 

Secure  Multiparty  Computation  (SMC)  approaches  address  performing  joint 
computation  in  a  distributed  system  where  each  party  reveals  no  information  other 
than  their  input  and  output  [27].  Although  any  polynomial-time  multi-party  tech¬ 
nique  can  be  performed  with  privacy  preservation  using  SMC,  the  cost  of  performing 
SMC  schemes  for  large-scale  models  can  be  too  high  [6,27,28].  As  inference  models 
for  fault  localization  are  large,  a  SMC  approach  to  fault  localization  may  not  be  vi¬ 
able.  The  framework  and  approach  presented  in  Chapter  111,  however,  may  decrease 
the  size  of  the  models  involved  sufficiently  to  enable  using  SMC. 
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F.  CONCLUSION 


As  seen  in  this  chapter,  there  has  been  recent  progress  in  fault  localization.  Of 
particular  note,  the  SCORE  (2005)  and  SHRINK  (2005)  algorithms  both  use  bipartite 
causal  graphs.  SCORE  uses  an  approach  that  embodies  the  principle  of  Occam’s 
Razor,  while  SHRINK  uses  Bayesian  inference  with  independence  assumptions.  The 
strengths  of  these  two  algorithms  make  them  excellent  vehicles  to  test  cross-domain 
fault  localization. 

Cross-domain  fault  localization  remains  an  under-researched  area,  but  there 
have  been  hints  of  progress  in  this  research  area.  The  only  notable  approach  found 
in  the  literature  shows  much  promise,  but  is  not  general  in  its  application. 
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III.  GRAPH  DIGEST  APPROACH 

This  chapter  presents  a  framework  for  cross-domain  fault  localization  that  al¬ 
lows  any  design  in  the  problem  space  to  be  evaluated.  The  chapter  provides  a  set  of 
criteria  to  explicitly  define  the  two  primary  requirements  of  cross-domain  fault  local¬ 
ization,  realization  of  inference  gain  and  protection  of  privacy ,  and  the  requirement 
of  scalability.  The  associated  metrics  for  accuracy  and  scalability  are  relatively  easy 
to  compute  and  make  it  possible  to  experimentally  evaluate  a  design  in  terms  of  these 
criteria.  The  chapter  further  provides  a  specific  approach,  using  graph  digests,  for 
use  with  fault  localization  models  based  on  causal  graphs. 

A.  GENERAL  FRAMEWORK 

As  discussed  in  Chapter  I,  there  are  three  general  approaches  for  diagnos¬ 
ing  cross-domain  problems.  This  chapter  provides  a  first  formulation  of  the  third 
approach,  whereby  domains  exchange  limited  information,  e.g.,  summaries  of  fault 
observations,  to  strike  a  balance  between  inference  gain  and  privacy  preservation. 
The  crux  of  the  formulation  is  a  set  of  general  metrics  to  measure  the  accuracy, 
privacy  protection,  and  scalability  of  a  given  cooperative  design. 

1.  Modeling  Inference  Gain 

A  design  is  useless  if  the  results  it  produces  are  not  useful  for  inference.  A 
design  cooperative  is  inference  preserving  if  it  maintains  enough  structure  to  allow 
successful  inference.  Ideally,  a  design  achieves  the  same  inference  gain  as  full  disclo¬ 
sure. 

This  research  addresses  two  specific  questions  regarding  the  benefits  of  using 
a  proposed  design: 

1.  What  is  the  change  in  inference  accuracy  by  using  the  design  for  cross-domain 

scenarios  compared  to  the  accuracy  achieved  when  domains  perform  inference 

in  isolation? 
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2.  What  is  the  decrease  in  inference  accuracy  caused  by  using  the  design  com¬ 
pared  to  the  accuracy  achieved  when  domains  collaborate  with  fully  disclosed 
information? 


Question  1  above  can  be  paraphrased  as  “What  is  gained  by  sharing  informa¬ 
tion  when  troubleshooting  a  problem?”  Question  2  looks  at  the  problem  from  the 
other  direction:  “What  is  lost  by  trying  to  keep  some  things  secret?”  If  the  answer 
to  Question  1  is  “a  lot”  then  the  design  is  effective  at  locating  faults.  If  the  answer  to 
Question  2  is  “not  a  lot”  then  the  design  is  efficient  at  realizing  the  potential  accuracy 
gain  of  cross-domain  fault  localization. 


a.  Accuracy  Metrics 

How  is  accuracy  measured?  Consider  n  domains  performing  fault  local¬ 
ization  and  let  Bt  denote  the  set  of  actual  faults  (i.e.,  the  ground  truth).  Let  the  best 
explanation  derived  by  isolated  inference  be  B{  for  each  domain  i.  Let  the  best  ex¬ 
planation  derived  by  full  disclosure  and  a  proposed  design  be  Bu  and  Bd  respectively. 
First  consider  the  case  of  isolated  inference.  Clearly  if  ( BT  —  (U "=1-BQ)  ^  0,  then  the 
isolated  inference  results  contain  false  negatives  (some  faults  were  not  found).  The  hit 
ratio  [23]  is  denoted  by  hs  and  measures  the  percentage  of  correct  results  in  U”=  { Bt  : 

\(U2Bi)nBT\ 


h*  = 


lh  B4 


(3.1) 


Likewise  if  ((U”=1Hj)  —  Bt)  ^  0,  then  the  inference  results  in  isolation  have  false 
positives.  The  coverage  ratio  [23]  (denoted:  cs )  measures  the  percentage  of  faults  in 
Bt  that  are  correctly  identified  by  U”=1Hj  : 

|(u, •/*,•)  n  nr\ 


C  C  = 


I  Bi 


(3,2) 


It  is  clear  that  1  >  h,  c  >  0.  The  ratios  of  false  positives  and  false 
negatives  are  1  —  h  and  1  —  c  respectively,  both  relative  to  Bt-  The  ratios  h  and  c 
can  each  be  easily  optimized  at  the  expense  of  the  other,  which  may  be  overcome  by 
computing  the  harmonic  mean  of  the  two  values.  The  harmonic  mean  of  precision 
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and  recall  is  also  known  as  F  Score  [3].  This  research  proposes  to  use  the  harmonic 
mean  a  as  the  criterion  to  measure  how  well  a  digest  model  preserves  inference  gain. 

The  overall  accuracy  of  isolated  inference  (denoted:  as)  is  the  harmonic 
mean  of  hs  and  cs: 


0  if  hs  =  cs  =  0 

as=\  ^  (3.3) 

\hs'Cs  otherwise. 

I  hs+cs 

The  value  of  as  ranges  from  0  (zero  accuracy)  to  1  (perfect  inference).  Intuitively,  a 
small  as  value  indicates  a  need  for  cross-domain  coordination. 

The  accuracy  using  full  disclosure  ( undigested  graphs)  is  calculated  by 


Oil! 


0  if  hu  =  cu  =  0 
W'Cu  otherwise, 

hu+cu 


where 


hn  = 


\BunB7 

IfiJ 


-and  cv  = 


\BunB7 

\Bt\ 


The  accuracy  of  a  proposed  design  is  calculated  by 

0  if  hd  =  cd  =  0 


\hd'Cd  otherwise, 

hd+cd 


where 


hd  = 


I  Bd  n  b7 
I  Bd\ 


-and  cd  = 


I  Bd  n  b7 
I  fir  I 


(3.4) 


(3.5) 


(3.6) 


(3.7) 


Without  special  consideration,  a  failure  hypothesis  involving  x  >  1  in¬ 
distinguishable  faults  will  result  in  adding  x  faults  to  the  best  explanation  every  time, 
adversely  impacting  the  hit  ratio  of  the  hypothesis.  These  faults  are  combined  into 
a  single  fault  to  calculate  the  scores  au,  ad,  and  as.  Consolidating  indistinguishable 
faults  is  consistent  with  the  SCORE  fault  localization  algorithm  [23]. 

To  quantify  the  inference  gain  A  from  using  a  design  (i.e.,  to  answer 
question  1  above),  this  research  proposes  to  compute  the  difference  between  its  infer¬ 
ence  accuracy  and  the  accuracy  achieved  by  domains  in  isolation: 

A  =  ad-as.  (3.8) 
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The  value  of  A  ranges  from  —1.0  to  1.0.  A  positive  score  means  that  the  design 
improved  fault  localization,  a  score  of  0.0  means  there  was  no  improvement,  and  a 
negative  value  means  that  using  the  design  was  worse  than  isolated  inference.  For 
example,  suppose  BT  =  {51,54},  Uj-B*  =  {52,55},  and  Bd  =  {51,55}.  Then 
hs  =  cs  =  0  and  hd  —  cd  —  0.5.  Thus,  the  inference  gain  A  equals  0.5  for  this  case. 

Similarly,  this  research  proposes  to  measure  the  cost  to  privacy  protec¬ 
tion  C  (i.e.,  to  answer  question  2  above)  with  the  metric 


(A  olu  otdl  (3.9) 

where  C  ranges  from  —1.0  to  1.0,  with  a  larger  value  indicating  a  higher  cost  T 
Continuing  with  the  example  above,  C  would  be  0.5  if  the  full  disclosure  approach 
achieves  perfect  accuracy,  i.e.,  Bu  =  Bt,  which  implies  au  =  1.0. 

Note  that  a  design  that  only  shares  limited  information  may  require 
dramatically  less  computation  as  compared  to  the  full  disclosure  approach.  In  other 
words,  the  design  is  much  more  scalable.  Section  3  discusses  how  to  quantify  this 
benefit. 

In  support  of  the  hypothesis  of  this  dissertation,  experiments  showed 
that  a  prototype  design  generally  achieves  better  accuracy  than  isolated  inference. 
To  test  this  hypothesis,  the  null  hypothesis  H0  used  was:  a  prototype  design  is  no 
better  than  isolated  inference.  H0  is  equivalent  to  ad  =  as,  which  is  just  A  =  0. 
The  alternative  hypothesis  H\  was  that  the  design  is  more  accurate  than  isolated 
inference,  or  A  >  0. 

2.  Modeling  Privacy  Preservation 

In  developing  criteria  for  privacy  preservation  this  section  first  presents  a  met¬ 
ric  to  measure  how  much  information  a  design  discloses  that  would  otherwise  remain 
undisclosed.  Recognizing  that  this  ideal  metric  may  not  be  practical,  Section  B 

1  Intuitively,  C  should  range  from  0.0  to  1.0 
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presents  practical  metrics  to  measure  information  disclosure  from  using  a  specific 
design  (using  a  graph  digest  approach). 

Before  measuring  privacy  preservation,  first  the  information  that  needs  to  be 
protected  must  be  established.  This  research  defines  a  sensitive  property  as  a  piece  of 
information  a  domain  manager  considers  private.  Ideally,  shared  information  should 
not  help  to  reveal  any  sensitive  properties.  Information  about  sensitive  properties 
should  never  be  distributed  unless  permitted  by  a  domain’s  local  security  policy. 
Specific  sensitive  properties  will  vary  between  domains  and  may  include  bottlenecks, 
customer  information,  peering  agreements,  and  many  other  characteristics.  Further¬ 
more,  a  collection  of  exchanged  information  from  a  domain  over  time  should  not  aid 
in  deriving  the  sensitive  properties. 

Shannon  said  that  “perfect  secrecy”  is  achieved  when  the  a  priori  probability 
is  equal  to  the  a  posteriori  probability  for  message  traffic  deciphering  by  an  adversary 
[32],  The  same  concept  applies  sharing  inference  information.  One  has  to  assume  that 
an  adversary  has  some  domain  knowledge,  has  passive  access  to  externally  observable 
information,  and  can  infer  some  level  of  knowledge  about  a  distribution  over  time. 
As  discussed  in  Chapter  II,  this  research  assumes  a  semi-honest  model. 

This  research  explores  information  theory,  which  overlaps  several  technical 
fields,  to  address  the  privacy  preservation  issue.  Entropy,  typically  measured  in  bits, 
is  foundational  to  information  theory.  Entropy  measures  uncertainty  about  a  proba¬ 
bility  distribution.  The  entropy  H(X )  of  the  random  variable  X  with  a  probability 
mass  function  p(x)  is  defined  by 

H(X)  =  ~^2p(x)  log2  p(x). 

Information  theory  provides  a  means  to  reason  about  entropy  between  two  distribu¬ 
tions.  [11] 

Using  an  information  theoretic  approach,  the  relative  entropy,  or  Kullback 
Leibler  (KL)  distance  [11],  between  a  probability  mass  function  of  the  random  variable 
representing  an  adversary’s  belief  about  a  sensitive  property’s  true  value  without 
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shared  information  and  a  probability  mass  function  after  receiving  shared  information 
measures  the  privacy  loss  due  to  implementing  a  cooperative  design.  Consider  a 
sensitive  property  that  can  be  modeled  by  a  discrete  random  variable  X.  Let  p(  ) 
represent  the  probability  mass  function  representing  an  adversary’s  belief  about  this 
sensitive  property  conditioned  by  externally  available  information.  Let  d(  )  represent 
the  probability  mass  function  representing  the  adversary’s  belief  further  conditioned 
by  shared  information.  The  KL  relative  entropy  equation  is 

KL(p(  )||d(  ))  =  rf^log2T7fy-  (3-10) 

In  the  best  case  this  distance  will  equal  zero  for  each  sensitive  property  in 
a  domain,  meaning  that  the  information  about  a  sensitive  property  is  unchanged 
after  sharing  information.  Even  if  the  entropy  is  reduced  for  a  sensitive  property,  the 
entropy  of  d(x)  may  remain  sufficiently  high  to  protect  the  privacy  of  the  property. 
Ultimately,  the  resultant  entropy  of  d(x)  and  not  the  amount  of  entropy  lost  as  given 
by  Equation  3.10,  indicates  the  level  of  privacy  protection  for  a  sensitive  property. 

If  prior  and  posterior  probability  distributions  modeling  an  adversary’s  belief 
about  a  sensitive  property  can  be  derived,  the  relative  entropy  Eq.  (3.10)  can  be  used 
to  evaluate  the  privacy  protection  for  a  property.  Although  the  KL  distance  appears 
to  be  a  perfect  measure  of  privacy  preservation,  it  is  extremely  difficult  to  apply  in 
practice  as  computing  the  KL  distance  requires  knowledge  of  the  prior  probability 
distribution  the  adversary  uses  (explicitly  or  implicitly)  to  guess  a  secret. 

3.  Modeling  Scalability 

Consider  the  SHRINK  algorithm,  which  achieves  polynomial  time  inference 
by  assuming  no  more  than  3  concurrent  SRG  failures  [19].  The  algorithm  still  must 
consider  (")  +  (”)  +  (!])  hypotheses  2  with  n  here  denoting  the  total  number  of 
SRGs.  The  computational  complexity  for  SHRINK  is  0(n4).  Clearly,  by  compressing 

2After  abstracting  away  the  null  hypothesis  and  the  “not  in  the  model”  hypothesis 


information  a  cooperative  design  will  reduce  the  magnitude  of  elements  in  a  model 
(e.g.  the  number  of  SRGs  n  in  the  SHRINK  model),  resulting  in  far  fewer  hypotheses 
to  consider  vs.  full  disclosure.  As  a  result,  such  a  design  is  intuitively  more  scalable 
in  terms  of  inference  running  time. 

This  research  proposes  a  direct  measurement  of  inference  running  times  to 
evaluate  scalability.  Let  tu  and  td  represent  the  recorded  average  running  times  for 
the  full-disclosure  and  a  proposed  design  respectively.  The  metric  E  to  quantify  the 
scalability  improvement  is  defined  by 

E  =  logw(^£).  (3.11) 

Thus,  E  measures  the  order  of  magnitude  of  reduction  in  inference  time  gained  by 
using  a  cooperative  design  as  compared  to  full  disclosure.  A  logarithmic  measurement 
is  used  for  E  to  clearly  present  the  order  of  magnitude  difference  in  the  running  time. 
A  value  for  E  much  greater  than  0  reflects  significant  savings  in  inference  time  by 
using  the  proposed  design,  a  value  close  to  0  reflects  little  or  no  savings,  and  a  value 
less  than  0  means  that  the  design  performed  slower  than  full  disclosure  inference. 

B.  GRAPH  DIGEST  APPROACH 

As  discussed  in  Chapter  II,  recent  intra-domain  approaches  use  graphical  mod¬ 
els  to  represent  dependencies  in  a  network,  particularly  the  causal  relationships  be¬ 
tween  hardware  failures  and  observed  anomalies.  These  models  (also  called  infer¬ 
ence  graphs),  enable  inference  algorithms  to  determine  those  failure  scenarios  best 
explaining  observed  anomalies.  In  practice,  faults  often  propagate  across  network  do¬ 
main  boundaries,  depriving  intra-domain  algorithms  of  critical  information  required 
for  accurate  inference.  This  research  addresses  the  problem  by  sharing  summarized 
intra-domain  models,  called  graph  digests  or  simply  digests  in  this  research,  between 
domains.  A  graph  digest  is  created  to  reflect  a  failure  scenario  and  captures  cross¬ 
domain  dependencies  while  hiding  internal  details. 
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A  cross-domain  inference  model  based  on  graph  digests  can  be  formally  defined 


as  follows.  Consider  n  network  domains: 


•  Gj  is  the  inference  graph  for  the  ith  domain. 


•  /  is  (ideally)  a  one-way  transformation  on  Gj  implementing  a  privacy  policy. 
/(Gj)  is  called  the  inference  graph  digest ,  or  simply  digest ,  for  Gj. 


•  Gj 


W./(Gj 


I ±lGj,  where  j  is  a  domain  performing  cross-domain  inference 


and  l±J  is  a  model-specific  union,  Qi  is  the  cross-domain  model  integrating  the 
digests  from  all  the  other  domains  with  domain  j’s  undigested  graph.  Now, 
domain  j  may  use  an  existing  algorithm  such  as  SHRINK  to  perform  inference 
over  gj . 


Before  a  practical  graph  digest  design  can  be  implemented,  interoperability 
standards  must  be  developed.  Domains  using  different  inference  methods  can  poten¬ 
tially  use  a  digest  approach  if  standards  are  implemented  and  adhered  to.  Items  to  be 
standardized  include  data  types  and  attributes  as  well  as  cross-domain  management 
structures  such  as  centralized,  distributed,  iterative,  etc.  Translation  procedures  are 
needed  in  order  to  convert  between  models.  This  research  defines  a  shared  attribute 
as  a  physical  entity  or  logical  concept  modeled  in  two  or  more  fault  propagation  infer¬ 
ence  graphs,  and  that  has  the  same  semantics  in  each  graph.  Shared  attributes  serve 
as  the  glue  that  allows  different  models  to  be  joined.  For  example,  a  shared  attribute 
that  models  the  event  that  packets  flow  across  a  peering  link  between  two  domains 
may  be  modeled  as  a  root  cause  in  one  domain’s  model  and  a  dependent  observation 
in  the  other.  In  order  to  create  a  domain  digest  to  connect  to  another  domain’s  fault 
propagation  inference  graph,  shared  attributes  must  be  identified  and  agreed  upon. 

The  process  for  creating  a  graph  digest  is  outlined  in  Figure  3.1.  When  a  fault 
is  detected,  if  isolated  inference  does  not  find  the  root  cause  participating  domains 
agree  on  which  domain  will  perform  inference  using  their  undigested  inference  graph 
(graph  Gj  above).  Each  other  domain  creates  a  digest,  if  required,  using  the  process  in 
Figure  3.1.  The  final  decision  in  the  graph  digest  creation  process  places  responsibility 
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Figure  3.1.  Process  to  create  a  graph  digest 


on  the  domain  creating  the  digest  to  ensure  compliance  with  local  security  policy.  This 
verification  step  places  a  human  in  the  loop  to  enforce  privacy  of  properties  deemed 
sensitive  by  the  network  administrator. 

1.  Practical  Privacy  Protection  Metrics 

The  KL  Distance  metric  for  privacy  is  applicable  when  realistic  distributions 
representing  an  adversary’s  belief  about  a  sensitive  property  can  be  constructed.  Un¬ 
fortunately,  deriving  accurate  probability  mass  functions  about  a  sensitive  property 
in  a  domain,  particularly  from  an  adversary’s  perspective,  may  not  be  possible.  This 
research  explores  a  more  pragmatic  approach:  characterize  the  effectiveness  of  various 
attacks  against  a  digest  to  learn  specific  sensitive  properties  about  the  digest’s  source 
domain.  Specifically,  the  research  provides  a  systematic  method  for  experimentally 
evaluating  attacks  against  a  causal  graph. 

a.  Modeling  a  Causal  Graph  Attack 

The  focus  of  this  research  is  on  developing  a  general  evaluation  method¬ 
ology,  not  on  developing  the  most  effective  attacks  on  causal  graphs.  An  exploration 
of  different  sensitive  properties  to  demonstrate  privacy  using  the  KL  Distance  did  not 
find  a  sensitive  property  with  a  meaningful  probability  distribution  function. 

As  a  practical  approach,  this  research  explored  learning  a  domain’s 
topology  (routers,  switches,  physical  and  VPN  links,  etc.)  from  the  network’s  in¬ 
ference  graph.  Not  surprisingly,  the  literature  search  did  not  uncover  previous  work 
addressing  attacks  on  causal  graphs.  Once  portions  of  a  network  topology  have  been 
learned,  sensitive  property  measurements  can  be  taken  on  the  constructed  topology. 
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b.  Modeling  Attack  Effectiveness 

There  are  many  properties  that  a  network  domain  administrator  may 
consider  sensitive.  However,  relatively  few  of  these  can  be  inferred  from  an  inference 
graph  for  the  domain.  For  example,  properties  such  as  detailed  customer  information 
or  operating  system  details,  are  most  likely  abstracted  away  from  the  graph.  Four 
example  sensitive  properties  that  could  be  inferred  from  an  inference  graph  and  used 
to  evaluate  privacy  protection  are: 

•  Domain  network  diameter 

•  Number  of  routers  in  a  domain 

•  Degree  of  the  node  with  the  highest  degree  in  a  domain 

•  Internal  reachability  between  a  pair  of  visible  gateways 

This  research  proposes  to  use  the  following  statistical  metrics  to  model 
the  effectiveness  of  an  attack  against  a  sensitive  property. 


•  Root  mean  square  error  (■ rMSE ). 

Let  X  =  {xi ,x2,  represent  the  collection  of  samples  for  a  set  of  m 

scenarios  where  the  property  has  a  fixed  true  value  of  P.  The  rMSE  for  that 
scenario  set  is  defined  by 


rMSE  =  y/E{{X  -  P )2)  =  J2(xi  ~  P )Vm-  (3.12) 

\  i=1 

The  interpretation  of  rM SE  is  straightforward:  if  the  rMSE  value  is  large 
relative  to  the  true  value  P,  the  attack  is  considered  unsuccessful. 

•  Generalized  standard  deviation  ( gSTD ). 

Usually  the  standard  deviation,  like  rMSE ,  should  be  defined  with  respect  to 
a  set  of  scenarios  where  the  property’s  true  value  is  fixed.  The  definition  is 
generalized  to  consider  samples  from  all  scenarios  used  in  an  evaluation.  Let 
{xi,X2 represent  the  collection  of  samples  for  all  M  scenarios.  The 
gSTD  is  computed  like  a  usual  standard  deviation  by 

gSTD  =  s/E{{X  -  E(X))2).  (3.13) 
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The  gSTD  has  a  desirable  feature:  it  captures  how  well  the  attack  algorithm 
tracks  the  fluctuation  in  the  true  value  of  the  property.  This  point  will  be 
further  articulated  in  Chapter  VI.  The  attack  is  considered  not  effective  if 
gSTD  is  small  relative  to  the  sample  mean  E(X).  For  this  reason,  gSTD  can 
be  viewed  as  a  good  indicator  of  the  KL  distance. 


C.  CONCLUSION 

This  chapter  covers  the  two  most  significant  contributions  of  the  research: 
first,  a  general  framework,  and  second,  a  graph  digest  approach. 

The  general  framework  provides  solid  metrics  to  evaluate  any  cooperative  de¬ 
sign  in  the  design  spectrum  defined  by  the  competing  requirements  of  accuracy  and 
privacy,  and  scalability.  These  metrics  provide  a  transparent  and  rigorous  way  to  ad¬ 
dress  the  core  issues  for  cross-domain  fault  localization  and  evaluate  any  such  design. 

In  the  graph  digest  approach,  domains  provide  abstract  representations  of 
their  fault  propagation  models,  coupled  with  evidence,  to  strike  a  balance  between 
accuracy  and  privacy.  By  distilling  the  shared  information  down  to  a  collection  of 
nodes  and  edges,  the  approach  intuitively  reduces  the  size  of  a  model  as  compared  to 
the  full  disclosure  approach. 
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IV. 


EVALUATION  METHODOLOGY 


This  chapter  presents  the  methodology  used  to  evaluate  the  framework  pre¬ 
sented  in  Chapter  III.  A  prototype  algorithm  for  creating  graph  digests  from  bipartite 
causal  graphs  was  developed  for  evaluation.  This  algorithm  is  used  to  construct  di¬ 
gests  for  numerous  failure  scenarios  across  a  range  of  realistic  network  topologies,  and 
then  evaluated  in  terms  of  inference  accuracy,  privacy  preservation,  and  scalability  as 
defined  in  Chapter  III.  The  evaluation  was  shaped  to  consider  scenarios  with  inherent 
cross-domain  characteristics. 

The  test  topology  selection  process  was  motivated  by  several  factors.  Most 
importantly,  to  satisfy  the  generality  of  the  approach,  the  test  topologies  represent  a 
wide  range  of  realistic  networks.  It  is  well-known  that  there  is  a  lack  of  topologies 
available  for  research  efforts  [34],  Instrumented  networks  may  not  contain  enough  va¬ 
riety  in  topology  to  represent  networks  in  the  general  case,  and  almost  certainly  would 
not  yield  reproducible  results.  Furthermore,  as  this  research  is  about  fault  localiza¬ 
tion  across  network  domains,  collected  data  from  instrumented  networks  would  not 
necessarily  contain  sufficient  numbers  of  samples  for  meaningful  results.  The  Naval 
Postgraduate  School  is  not  a  member  of  either  the  Abilene  or  GEANT  projects,  and 
any  collected  cross-domain  data  from  this  project,  if  it  exists,  is  not  publicly  available. 
Given  the  need  for  generality  and  lack  of  available  suitable  test  topologies,  topologies 
containing  realistic  topology  components  (atoms)  were  constructed.  To  provide  em¬ 
pirical  evidence  of  the  applicability  to  real  network  topologies,  an  additional  topology 
based  on  the  Abilene  [1]  network  was  used. 

To  further  evaluate  the  generality  of  the  approach,  experiments  were  conducted 
by  performing  inference  with  both  the  SHRINK  and  SCORE  algorithms  using  six 
different  synthetic  topologies:  two  types  of  relationship  between  domains,  and  each 
relationship  has  three  topologies  (small,  medium,  and  large).  Both  provider-customer 
and  peer-peer  relationships  were  modeled  since  autonomous  systems  typically  peer 
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using  one  of  these  two  relationships  [42],  and  so  the  generality  of  shared  attributes 
could  be  demonstrated.  Additional  experiments  were  performed  using  a  topology 
based  on  the  Abilene  network  in  a  provider-customer  setting.  For  each  topology, 
data  was  collected  for  single  and  double  failure  scenarios.  If  an  observation  node  could 
exist  in  another  domain  that  provides  evidence  about  an  SRG,  this  research  defines 
that  SRG  as  a  cross-domain  SRG.  Failure  scenarios  were  generated  randomly,  but 
in  order  to  favor  scenarios  requiring  cross-domain  fault  localization,  failure  selection 
was  constrained  such  that  at  least  one  failure  in  a  scenario  must  be  a  cross-domain 
SRG.  All  single  failure  scenarios  that  satisfy  this  constraint  were  evaluated.  There 
are  a  total  of  9  such  scenarios  in  each  provider-customer  setting.  In  the  peer-peer 
setting  there  are  24,  42,  and  68  such  failures  in  the  small,  medium,  and  large  topology 
respectively.  Three  data  collection  cycles  of  fifty  failure  scenarios  each  for  the  double 
failure  scenarios  were  executed,  yielding  150  distinct  double  failure  scenarios  for  each 
of  the  small,  medium,  and  Abilene-based  topologies.  For  both  of  the  large  topologies, 
two  collection  cycles  of  twenty-five  failure  scenarios  were  executed,  resulting  in  fifty 
distinct  double  failure  scenarios. 

A  variation  of  the  decentralized  collaboration  model  (Chapter  II)  was  selected 
for  digest  exchange.  In  the  implemented  model,  each  participating  domain  passed  a 
digest  to  a  single  domain,  which  then  performed  inference  on  behalf  of  all  of  the  par¬ 
ticipating  domains.  The  model  intuitively  provides  the  best  choice  for  the  provider- 
customer  domain  relationship.  With  many  customers,  the  provider  network  domain 
has  shared  attributes  with  each  customer,  while  customers  may  not  have  shared  at¬ 
tributes  with  each  other.  In  each  failure  scenario  the  domain  identified  as  Domain  1 
performed  inference,  adding  a  digest  from  Domain  2  to  the  Domain  1  causal  graph. 

Detailed  descriptions  of  the  models  used  with  early  versions  of  the  prototype 
algorithm  and  attack  heuristic  are  documented  in  previously  published  incremental 
steps  of  this  research  [15,16]. 
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createBipartiteDigest(G) 

1:  Add  node  Lnew  to  G 

2:  for  all  SRG  St  G  G 

3:  if  (for  all  edges  (S),  Lj )  G  G,  Lj  is  up) 

4:  then  Prune  S,  and  its  edges  (Si:  Lj) 

5:  else 

6:  Collect  edges  (S),  Lj)  G  G  such  that  Lj  is  up 

7:  if  At  least  one  such  edge  exists 

8:  Add  edge  (Si,  Lnew) 

9:  Prune  collected  edges  (S),  Lj) 

10:  Remove  all  isolated  observation  nodes  Lj 

11:  for  all  SRG  Sx,Sy  G  G 

12:  if  Sx  and  Sy  are  indistinguishable 

13:  Aggregate  Sx  and  Sy  into  S'x  such  that  S'x  —  Sx  U  Sy 

14:  Rename  all  SRGs  that  are  not  shared  attributes 

15:  Rename  all  Observation  nodes  other  than  Lnew 

Figure  4.1.  Algorithm  for  computing  a  digest  from  a  bipartite  causal  graph  G. 

A.  DIGEST  ALGORITHM 

As  the  target  of  evaluation,  a  prototype  digest  creation  algorithm  was  cre¬ 
ated.  The  algorithm  uses  simple  techniques,  such  as  node  and  edge  pruning,  partial 
evaluation,  aggregation,  and  node  renaming.  Information  such  as  prior  probabilities 
and  conditional  probabilities  have  been  anonymized  by  setting  all  respective  values 
to  the  same  strength.  The  algorithm  originally  used  Noisy-OR  to  combine  edges  in 
the  digest  causal  graph  directed  to  “up”  observation  nodes.  Using  Noisy-OR  helped 
to  preserve  inference  information  about  these  observation  nodes  that  are  condition¬ 
ally  dependent  on  the  SRG  being  evaluated.  However,  this  practice  was  found  to  be 
very  revealing  about  a  domain  network  topology  since  the  edge  strength  indicated 
the  number  of  neighbors  a  device  has.  Therefore,  Noisy-OR  was  replaced  with  logical 
OR  in  the  implementation  of  the  algorithm.  See  Figure  4.1  for  detailed  pseudo  code 
of  the  digest  creation  algorithm. 

The  focus  of  this  work  is  on  the  evaluation  methodology  and  hence  the  devel¬ 
opment  of  potentially  more  effective  digest  creation  algorithms  is  left  to  future  work. 
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If  a  rather  simplistic  digest  creation  algorithm  performed  promisingly,  however,  it 
would  be  reasonable  to  conclude  positively  about  the  feasibility  of  cross-domain  fault 
localization  when  more  polished  techniques  are  used. 

B.  PROVIDER-CUSTOMER  TEST  TOPOLOGIES 

This  section  introduces  the  provider-customer  test  topologies.  In  a  provider- 
customer  relationship,  one  domain  (the  provider)  provisions  network  backbone  con¬ 
nectivity  to  a  second  domain  (the  customer).  In  many  cases,  the  provider’s  physical 
topology  (e.g.,  SONET  connections  multiplexed  on  fiber)  is  not  observable  by  the 
customer.  The  customer  only  sees  IP  connections  entering  the  edge  device  on  one 
side  of  the  provider’s  cloud  and  exiting  on  the  other.  Sometimes  only  a  core  router 
is  visible.  In  any  case,  many  sources  of  faults  are  not  visible  to  the  customer.  Fur¬ 
thermore,  configuration  problems  on  either  the  customer’s  or  the  provider’s  side  may 
result  in  faults  that  are  not  readily  observable  by  both  parties. 

For  the  provider-customer  topologies,  with  the  assumption  that  failures  are 
not  total  in  the  provider  network  and  individual  IP  flows  are  not  instrumented  for 
fault  detection  by  the  provider,  observations  were  denied  about  customer  flows  to  the 
provider. 

1.  Physical  Topologies 

Each  topology  simulates  a  provider-customer  network  setting  in  which  a  cus¬ 
tomer  transits  the  provider  domain  using  three  leased  circuits.  The  small  topol¬ 
ogy  depicted  in  Figure  4.2,  is  loosely  based  on  the  topology  used  by  the  authors  of 
SHRINK  [19].  The  provider  network  (Domain  1  in  Figure  4.2)  consists  of  Optical  Dig¬ 
ital  Cross  Connect  switches  and  fiber  links  to  transit  customer  traffic.  The  customer 
in  the  evaluation  leases  three  optical  circuits  that  transit  the  fiber  mesh  as  depicted  in 
Figure  4.3.  Additionally,  several  VPN  tunnels  are  modeled  in  the  customer  topology, 
shown  in  Figure  4.4,  to  explore  the  effects  this  realistic  networking  practice  has  on 
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Figure  4.2.  Provider-Customer  Small  Physical  Topology. 


Domain  2 


Identifiers: 

O*  ODCX  - 

F*  Optical  Fiber 
C*  Leased  Optical  Circuit 


Figure  4.3.  Provider- Customer  View  of  Service  Provisioning. 
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Figure  4.4.  Provider-Customer  VPN  Overlay. 

the  framework.  The  study  focuses  on  finding  cross-domain  faults  that  occur  between 
the  provider  domain  and  one  of  its  customers  (Domain  2  in  Figure  4.2). 

The  customer  topology  grew  on  each  side,  adding  sub-components  to  reflect 
realistic  network  topology  elements  to  create  the  medium  (Figure  4.5)  and  large  (Fig¬ 
ure  4.6)  network  topologies.  The  small,  medium,  and  large  customer  network  domains 


Figure  4.5.  Provider-Customer  Medium  Physical  Topology. 
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Figure  4.6.  Provider-Customer  Large  Physical  Topology. 


have  18,  54,  and  204  routers  respectively.  The  sub-components  in  the  expanded  net¬ 
works  have  varying  properties,  such  as  node  degree  and  distance  between  elements 
in  the  domain,  and  are  connected  in  mesh,  star,  ring,  and  ad  hoc  topologies.  The 
medium  and  large  network  topologies  use  the  same  provisioning  (Figure  4.3)  and 
customer  VPN  tunnels  (Figure  4.4)  as  in  the  small  topology. 

2.  Modeling 

To  illustrate  how  a  causal  graph  models  fault  propagation  and  how  graph 
digests  are  used,  a  detailed  description  of  inference  using  the  graph  digest  approach 
is  provided  next  using  the  small  provider-customer  physical  topology  (Figure  4.2). 

As  illustrated  in  Figure  4.2,  Domain  1  has  two  optical  cross  connect  switches 
(O i  and  O2 )  and  four  fiber  links  ( F\  . . .  F4)  as  SRGs.  In  Domain  2  (the  customer 
domain)  each  router  (R4  . . .  R18 )  and  point-to-point  link  between  adjacent  routers 
(e.g.,  R\  —  R3 )  are  modeled  as  an  SRG.  Every  SRG  failure  in  the  customer  domain 


SRGs 


Figure  4.7.  Domain  1  causal  graph  with  respect  to  Domain  2. 

SRGs 

©  ©  (a)  ^ . (£ 


Observation  Nodes 


Figure  4.8.  Domain  2  causal  graph  for  small  topology. 


generates  observations  about  the  failure.  The  following  observation  nodes  are  modeled 
in  Domain  2:  the  IP  connections  between  each  pair  of  adjacent  routers;  the  3  internal 
VPN  tunnels  (R2  —  R3),  (R3  —  R6),  and  (i?15  —  R17);  the  cross-domain  IP  connections 
(.R4  —  -Rll),  (-R4  —  -R12),  and  (i?ll  —  .R12);  and  the  cross-domain  VPN  tunnel  (R3  — 
R15).  The  three  leased  circuits  underlying  the  cross-domain  IP  links  serve  as  the 
shared  attributes  for  this  setting,  with  Domain  1  modeling  the  shared  attributes  as 
observation  nodes  ( A4 . . .  A3  in  Figure  4.7),  and  Domain  2  modeling  them  as  SRG 
nodes.  There  arc  nine  cross-domain  SRGs  from  both  domains  (O i,  O2,  F\  ...  F4j  i?4, 
-R11,  and  R42)  in  the  customer-provider  setting. 

As  the  provider  domain  (Domain  1)  would  have  many  cross-domain  SRGs 
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for  different  customers  and  a  contractual  obligation  to  provide  transit  service,  the 
provider  domain  was  selected  to  perform  inference  on  behalf  of  its  customers  for 
the  graph  digest  approach.  For  each  of  the  failure  scenarios,  the  customer  domain 
generated  a  digest  for  inference  by  the  provider  domain.  The  Domain  2  small  topology 
causal  graph  is  presented  in  Figure  4.8.  The  routers  R1 . . .  R18,  the  point-to-point 
links  Px-y  where  x  and  y  are  the  pair  of  adjacent  routers  Rx  and  Ry,  and  the  shared 
attributes  A\ . . .  A3  are  identified  as  the  SRGs.  The  observation  nodes  are  the  IP 
links  between  the  routers  Lxy  and  VPN  tunnels  VX:V,  where  x  and  y  designate  the 
routers  on  either  end  of  the  links  or  tunnels. 

The  Domain  2  digest  created  after  observing  connection  failures  T4-11  and 
-bii-12  is  depicted  in  Figure  4.9.  The  SRGs  i?4,  Rll,  and  R12  have  been  anonymized 
as  S1...3,  and  IP  links  £4-11  and  T11-12  as  L\  and  L2.  Only  the  special  observation 
node  Lup  observes  an  “up”  state  and  all  other  observation  nodes  report  a  “down” 
state.  All  SRG  prior  probabilities  are  set  to  a  uniform  value;  likewise  all  conditional 
dependencies  (the  edges)  have  a  uniform  value. 

An  ad  hoc  node  collapsing  methodology  is  used  to  form  a  union  between  the 
causal  graphs,  which  starts  by  merging  the  shared  attributes  from  each  causal  graph. 
Next,  each  observation  node  inherits  all  conditional  dependencies  from  all  shared 


43 


SRGs 


attributes  on  which  the  observation  node  is  dependent  (e.g.,  if  an  edge  exists  from  a 
shared  attribute  to  an  observation  node,  then  all  edges  into  the  shared  attribute  from 
an  SRG  are  copied  to  that  observation  node).  Finally,  the  shared  attribute  nodes 
are  removed.  As  an  example  F\  has  an  edge  to  A4  in  the  Domain  1  causal  graph 
(Figure  4.7)  and  A\  has  an  edge  to  L\  in  the  Domain  2  digest  (Figure  4.8),  thus 
F\  gains  an  edge  to  L\  in  the  causal  graph  union.  The  model-specific  union  of  the 
Domain  1  causal  graph  with  Domain  2’s  digest  is  shown  in  Figure  4.10.  In  this  sample 
scenario  SHRINK  and  SCORE  each  return  F3  as  the  best  explanation  for  both  the 
full  disclosure  and  graph  digest  approaches. 

There  may  be  information  loss  with  the  transitive  method  of  inheriting  con¬ 
ditional  dependence  described  above.  Consider  the  observation  node  L4_u  in  Figure 
4.8.  This  node  is  conditionally  dependent  on  three  SRGs  in  the  Domain  2  graph. 
The  same  node,  represented  by  L\  in  Figure  4.10,  is  conditionally  dependent  on  seven 
SRGs  after  creating  the  union  of  the  Domain  2  digest  and  the  undigested  Digest  1 
causal  graph.  Depending  on  the  failure  scenario,  a  different  number  of  SRGs  may 
populate  the  conditional  probability  table  (CPT)  for  observation  node  L4_n.  Since 
the  prior  probabilities  of  the  SRGs  and  the  conditional  dependencies  have  been  set 
uniformly  in  this  example  scenario,  the  values  in  the  table  are  distorted  and  infor¬ 
mation  is  lost.  If  the  correct  distributions  were  learned  over  time,  or  intentionally 
disclosed,  the  CPTs  for  the  shared  attributes  could  be  populated  with  the  correct 
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distributions. 

In  the  provider-customer  topologies  (Figures  4.2,  4.5,  4.6)  F\ ,  F2,  and  O \  are 
indistinguishable  to  SHRINK  and  SCORE.  Statistically,  |  of  the  randomly  generated 
failure  scenarios  will  contain  the  failure  of  one  of  these  three  components.  As  discussed 
in  Chapter  III,  these  nodes  are  combined  into  a  single  SRG  to  calculate  the  au,  ctd, 
and  as  scores. 

C.  PEER-PEER  TEST  TOPOLOGIES 

The  second  class  of  topologies  considers  two  network  domains  with  a  peer-peer 
relationship.  In  this  relationship,  each  domain  provides  connectivity  to  its  customers, 
and  neither  domain  provides  Internet  connectivity  to  the  other  [42],  The  two  do¬ 
mains  share  multiple  peering  points  and  web  service  connections.  These  web  service 
connections  represent  monitored  IP  connections  of  interest  between  pairs  of  servers 
hosted  in  different  domains.  Ownership  of  the  shared  links  and  hosting  of  the  services 
may  be  equally  distributed  between  the  two  domains.  IP  link  and  web  service  failures 
are  fully  visible,  and  device  failures  are  considered  total  -  an  SRG  failure  causes  an 
observable  failure  event. 

1.  Physical  Topologies 

A  similar  process  to  create  the  provider- customer  topologies  was  used  to  create 
the  peer-peer  topologies,  incorporating  realistic  network  domain  subcomponents.  The 
small  physical  topology  is  presented  in  Figure  4.11.  The  peer-peer  domains  in  the 
small  topology  have  two  peering  points  (A4  —  All)  and  (A6  —  A17).  The  shared 
attributes  Al  and  A2  model  the  event  that  the  cross-domain  connection  is  live. 

There  are  five  web  services,  W1...W5,  with  cross-domain  dependencies  as 
shown  in  Figure  4.12.  For  each  web  service  with  a  cross-domain  dependency,  one  do¬ 
main  models  ownership  of  the  web  service  and  the  other  domain  models  a  dependency 
on  that  service.  The  shared  attributes  A3  . . .  A6  model  the  event  that  the  servers  can 
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Figure  4.11.  Peer-Peer  Domains  Small  Physical  Topology. 


Figure  4.12.  Peer-Peer  Domains  Small  Services  View. 
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Figure  4.13.  Peer-Peer  Domains  Medium  Physical  Topology. 


reach  each  other  along  the  shortest  path  for  each  pair  of  dependent  services. 

The  medium  topology  (Figure  4.13)  is  modeled  with  four  peering  points  and 
eight  web  service  connections  (Figure  4.14),  and  the  large  topology  (Figure  4.15)  with 
eight  peering  points  and  sixteen  web  service  connections  (Figure  4.16).  The  treatment 
of  the  peering  points,  cross-domain  web  services,  and  shared  attributes  follows  the 
same  reasoning  and  implementation  as  discussed  for  the  small  domain  topology. 

2.  Modeling 

The  SRG  and  observation  nodes  are  modeled  as  in  the  customer-provider  set¬ 
ting,  and  use  the  same  notation.  The  set  of  cross-domain  SRGs,  from  which  each 
failure  scenario  must  have  a  failed  component,  contains  every  peering  point  router 
and  link,  and  every  router  and  link  on  the  shortest  path  between  the  servers  for  each 
web  service  dependency.  A  total  of  24,  42,  and  68  cross-domain  SRGs  were  identified 
in  the  small,  medium,  and  large  topologies,  respectively. 

There  are  two  types  of  shared  attributes  for  this  scenario.  The  first  type 
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Figure  4.14.  Peer-Peer  Domains  Medium  Services  View. 


Domain  1 


Domain  2 


Q  Domain  1  service 
Q  Domain  2  service 
—  Peering  link 


Figure  4.15.  Peer-Peer  Domains  Large  Physical  Topology. 
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Figure  4.16.  Peer- Peer  Domains  Large  Services  View. 


represents  the  peering  links  between  the  domains  and  A 2  in  Figure  4.11).  For 
each  such  link,  one  domain  owns  the  link,  and  this  domain  models  it  as  an  observation 
node;the  other  domain  models  it  as  an  SRG  node.  The  second  type  of  shared  attribute 
describes  whether  a  pair  of  servers  can  connect  with  each  other.  For  each  web  service, 
the  domain  hosting  the  service  models  the  shared  attribute  as  an  observation  node 
while  the  domain  on  the  client  side  models  it  as  an  SRG.  In  the  evaluated  peer-peer 
topologies,  both  domains  observe  the  state  of  an  event  modeled  by  a  shared  attribute. 

The  construction  of  causal  graphs  for  the  peer-peer  domain  setting  proceed 
similarly  as  with  the  provider-customer  setting. 

D.  ABILENE-BASED  TOPOLOGY 

To  provide  an  empirical  evaluation  using  an  established  network  topology,  a 
topology  was  constructed  based  on  the  Abilene  network  backbone  [1]  (Figure  4.17). 
Customer  domain  connections  to  the  Abilene  network  are  modeled  as  stub  routers.  A 
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Figure  4.17.  Abilene-based  topology. 


Figure  4.18.  Provider  domain  to  the  Abilene  network. 


notional  second  network  (Figure  4.18)  was  added  across  the  Midwest  United  States 
to  decrease  the  diameter  of  the  Abilene-based  network.  This  second  network  serves 
in  the  role  as  a  provider  to  the  Abilene  network.  The  domain  relationship  and  causal 
graph  construction  for  the  Abilene-based  network  are  the  same  as  for  the  provider- 
customer  relationship  described  in  Section  B  above,  with  the  Abilene-based  network 
filling  the  role  of  the  customer.  The  Abilene-based  network  transits  three  circuits, 
Cl . . .  C 3,  provided  by  the  added  network.  The  gateway  routers  that  connect  to  the 
circuits  are  routers  i?4,  R7,  and  R20  in  Figure  4.17. 

E.  INFERENCE  ALGORITHMS 

Two  recent  fault  localization  algorithms,  SHRINK  [19]  and  SCORE  [23],  were 
selected  (as  discussed  in  Chapter  II),  to  evaluate  the  graph  digest  design.  The  use 
of  two  different  algorithms  is  intended  to  provide  evidence  for  the  generality  of  the 
approach.  Although  neither  algorithm  was  crafted  to  perform  cross-domain  fault 
localization,  graph  digests  can  be  passed  across  domain  boundaries  for  inclusion  in  a 
consolidated  inference  effort.  The  two  algorithms  are  fundamentally  different,  with 
SHRINK  using  a  Bayesian  approach  and  SCORE  using  a  greedy  minimum  set  cover 
approach. 

Performing  isolated  inference  with  SHRINK  and  SCORE  is  straightforward. 
Each  domain  performed  inference,  without  benefit  of  collaboration,  on  their  own 
causal  graph.  The  resulting  best  explanations,  B\  and  f?2  from  domains  1  and  2 
respectively,  were  then  combined  into  B\  UFG  To  implement  full  disclosure,  a  global 
causal  graph  for  the  two  domains  was  created  using  a  full  view  of  the  topologies,  and 
then  inference  was  performed  on  this  global  causal  graph  to  derive  Bu.  Finally,  for 
the  graph  digest  approach,  the  causal  graph  of  Domain  2  was  first  processed  with  the 
digest  creation  algorithm.  The  resulting  digest  was  then  combined  with  the  causal 
graph  of  Domain  1  using  the  techniques  described  in  Section  2,  and  inference  was 
performed  on  the  combined  graph  to  produce  B d- 
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F.  EVALUATION  METRICS 


1.  Evaluating  Accuracy 

To  evaluate  the  accuracy  A  and  cost  C  metrics  (defined  in  Chapter  III),  for 
each  failure  scenario,  first  the  inference  accuracy  of  isolated  inference  and  full  disclo¬ 
sure  relative  to  ground  truth  was  computed  for  each  failure  scenario.  These  results 
were  compared  to  the  accuracy  achieved  with  the  graph  digest  approach  for  the  same 
failure  scenarios.  The  digest  creation  algorithm  presented  in  Figure  4.1  was  used  to 
create  the  graph  digest  for  Domain  2.  The  equations  3.8  and  3.9  were  applied  to 
determine  the  accuracy  A  and  cost  C  for  each  failure  scenario. 

Hypothesis  testing  showed  whether  the  graph  digest  approach  achieved  better 
accuracy  than  isolated  inference  for  each  topology  and  inference  method  used.  There 
are  two  domain  relationships  with  three  topology  sizes  each  and  two  inference  algo¬ 
rithms,  for  a  total  of  twelve  data  sets  using  the  synthetic  topologies  with  which  to  test 
the  hypothesis.  Both  SHRINK  and  SCORE  performed  inference  on  the  Abilene-based 
topology,  providing  two  additional  data  sets  for  hypothesis  testing.  With  expected 
non-normal  distributions,  the  hypothesis  was  evaluated  using  the  Wilcoxon  Signed- 
Rank  Test  at  the  95%  confidence  level  [12],  Specifically,  the  H0  and  Hi  hypotheses 
are  that  the  graph  digest  approach  achieves  the  same  accuracy  as  isolated  inference, 
and  that  the  graph  digest  approach  achieves  better  accuracy  than  isolated  inference, 
respectively. 

2.  Evaluating  Privacy  Protection 

The  privacy  protection  provided  by  the  prototype  digest  algorithm  presented 
in  Figure  4.1  was  used.  As  discussed  in  Chapter  III  the  practical  approach  was  used 
to  gauge  the  risk  to  privacy.  Both  of  the  evaluation  inference  algorithms,  SHRINK 
and  SCORE,  were  designed  to  find  faults  at  the  physical  layer  based  on  observations 
at  the  link  and  network  layers.  For  this  reason,  sensitive  properties  that  could  be 
measured  from  such  a  reconstructed  topology  were  selected. 
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In  the  evaluation,  the  following  four  sensitive  properties  were  directly  mea¬ 
sured: 

1.  diameter  -  the  diameter  of  the  network. 

2.  number  of  routers  -  the  total  number  of  routers. 

3.  maximum  node  degree  -  the  degree  of  the  node  with  the  highest  degree. 

4.  reachability  -  whether  or  not  an  internal  path  can  be  inferred  between  two 

gateway  routers. 

For  each  of  these  sensitive  properties,  the  ground  truth  values  were  the  internal, 
or  “hidden”  values.  As  an  example,  an  18  router  network  with  3  visible  gateways  has 
15  hidden  routers.  A  digest  revealing  3  internal  routers  reveals  3  of  the  15  hidden 
routers.  The  diameter  and  degree  sensitive  properties  are  evaluated  similarly,  and 
the  reachability  property  is  binary. 

Some  may  argue  for  treating  IP  link  and  web  service  failures  themselves  as  a 
sensitive  properties.  Hiding  evidence  of  failure,  however,  is  both  counter-productive, 
and  directly  opposed  to  the  stated  objective  of  sharing  external  observations  of  fail¬ 
ure.  Consider  the  provider-customer  domain  relationship  as  discussed  above.  The 
customer  may  need  to  give  the  provider  specific  information  about  failures  so  that 
the  provider,  who  is  contractually  obligated  to  provide  and  maintain  connectivity, 
to  restore  service.  Similarly  peer-peer  domains  may  have  contractual  obligations  to 
their  respective  customers. 

No  additional  measures  to  hide  sensitive  properties  were  taken,  but  rather  the 
digest  creation  algorithm  (Figure  4.1)  was  evaluated  for  the  inherent  privacy  protec¬ 
tion  provided.  Removing  sensitive  information  from  a  causal  graph  prior  to  digesting 
may  provide  additional  protection,  but  the  accuracy  trade-offs  must  be  understood. 
As  shown  in  the  final  decision  step  in  Figure  3.1,  a  digest  is  evaluated  for  compliance 
with  a  security  policy  after  digesting  and  before  dissemination.  Ultimately,  provable 
techniques  are  needed  to  evaluate  the  level  of  privacy  disclosure  for  a  digest.  This 
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Figure  4.19.  Example  causal  graph. 


Figure  4.20.  Topology  produced  by  the  attack  heuristic. 

research  developed  and  implemented  attack  heuristics  to  learn  sensitive  properties  of 
a  network  domain  from  its  digested  causal  graph  to  gauge  privacy  protection. 

The  heuristic  was  implemented  to  specifically  attack  SHRINK-style  bipartite 
causal  graph  digests.  For  brevity  a  simplified  description  of  the  attack  heuristic  is 
presented. 

Consider  the  example  causal  graph  in  Figure  4.19.  The  causal  graph  has  SRG 
nodes  S± . . .  Sq,  observation  nodes  L i . . .  L3,  and  a  shared  attribute  observation  node 
Ai.  Suppose  it  is  known  that  represents  a  peering-point  shared  attribute.  It  can 
now  be  concluded  that  S4  is  a  gateway  router.  Next  observe  that  L3  has  3  parent  SRG 
nodes,  with  Si  having  a  cardinality  of  1.  SRG  Si  is  most  likely  a  point-to-point  link 
connecting  the  gateway  S4  and  an  adjacent  router:  S5.  Now  nodes  S4  and  S5,  and 
edge  (S4,S5)  are  in  a  graph  representing  the  topology.  Applying  the  same  reasoning 
with  L2  and  L3  allows  adding  node  S6  and  edges  (S5,S6)  and  (S4,S6),  resulting  in 
the  topology  shown  in  Figure  4.20. 

The  attack  heuristic,  which  is  presented  at  a  high  level  in  Figure  4.21,  proceeds 
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Digest-  Attack- Heuristic  (Digest  D  =  ( SRG  S,  Observation  O ), 

Shared  Attributes  SA ) 

1  Extract- Topology  (Digest  D ,  Shared Attributes  SA) 

2  Add  externally  visible  topology  components 

3  Evaluate-Properties(T opology  T  =  ( Routers  R,  Links  L)) 

Figure  4.21.  Heuristic  used  to  attack  graph  digests. 

in  three  phases.  First,  the  heuristic  attempts  to  derive  topology  from  a  graph  digest. 
Second,  the  heuristic  adds  missing  externally  visible  components,  such  as  gateway 
routers  and  transit  links,  to  an  extracted  topology.  Third,  the  heuristic  uses  the 
topology  to  estimate  the  values  of  sensitive  properties.  The  attack  heuristic  (Figure 
4.21)  takes  two  parameters:  the  digest  D  and  the  shared  attributes  SA.  Each  shared 
attribute  includes  a  distinction  of  whether  or  not  it  represents  a  peering  point.  The 
heuristic  was  designed  to  attack  both  an  undigested  bipartite  causal  graph,  and  a 
digest  created  using  the  prototype  digest  creation  algorithm  (Figure  4.1).  Eventually, 
a  collection  of  sample  values  is  obtained  for  each  property  from  each  target  set  of 
scenarios. 

A  conservative  approach  to  estimating  the  values  was  implemented,  which 
represents  an  estimated  lower  bound  on  the  information  learned.  In  other  words,  the 
attack  uncovers  values  for  sensitive  properties  that  are  almost  certainly  true.  As  an 
example,  suppose  an  attack  uncovers  two  adjacent  routers,  each  having  an  unresolved 
edge  representing  a  connection  to  another  router.  The  heuristic  assumes  that  these 
unresolved  edges  connect  to  the  same  router. 

Topology  extraction,  the  first  phase  of  the  attack  heuristic,  is  outlined  in  Figure 
4.22.  The  procedure  assumes  a  bipartite  digest  consisting  of  SRGs  S,  Observation 
nodes  O,  and  edges  E  represented  as  adjacency  lists.  The  procedure  uses  a  function 
adj  ()  defined  by 

adj  :  O  U  S  ->  2°  U  25, 
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where 


{o  G  0\{x,  o)  G  E}  if  x  G  S 

adj(x)  =  <  (4.1) 

I  {s  e  S\(s,  x)  G  E}  if  x  G  O 

A  discussion  of  topology  extraction  from  a  customer’s  causal  graph  in  a  provider- 
customer  domain  relationship  (Figure  4.22)  follows.  The  extraction  begins  by  iden¬ 
tifying  candidate  point  to  point  links.  The  heuristic  steps  through  all  SRG  nodes 
looking  for  any  that  points  to  just  one  observation  node  having  three  or  fewer  SRG 
parents.  In  a  SHRINK-style  model,  a  point-to-point  link  SRG  will  be  a  parent  node  of 
an  IP  link  observation  node  between  adjacent  routers.  Before  any  digest  transforma¬ 
tion  and  without  any  web  services  or  VPN  tunnels  transiting  the  link,  this  observation 
node  has  three  parent  nodes:  the  point-to-point  link  and  two  routers.  In  this  case, 
the  point-to-point  link  SRG  will  have  no  other  child  observation  nodes.  However,  the 
link  may  have  one  or  more  transiting  VPN  tunnels  or  web  services,  represented  by 
a  separate  observation  node.  The  point-to-point  link  will  be  a  parent  node  of  these 
observation  nodes  as  well,  each  of  which  will  likely  have  multiple  (more  than  three) 
SRG  parent  nodes. 

Now  that  some  of  the  candidate  point-to-point  links  have  been  identified,  all 
down  observation  nodes  are  evaluated,  starting  with  those  having  three  or  fewer 
parent  SRGs.  With  the  topologies  used  in  this  research,  these  comprise  the  majority 
of  the  observation  nodes.  Two  points  of  clarification  are: 

1.  Shared  attribute  nodes  are  never  added  as  routers. 

2.  A  tie  breaker  must  be  used  when  evaluating  equal  options. 

An  observation  node  having  one  parent  SRG  is  most  likely  a  router  and  link 
that  have  been  aggregated,  so  that  parent  SRG  node  is  added  to  the  set  of  routers. 
An  observation  node  with  two  parents  likely  represents  a  stub  router  and  a  link,  so  the 
node  of  highest  degree  is  added  to  the  set  of  routers.  An  observation  node  with  three 
parents,  the  most  common  case  in  an  undigested  graph,  most  likely  represents  two 
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routers  connected  by  a  point-to-point  link.  The  two  parent  nodes  with  the  highest 
degree  are  added  to  the  set  of  routers.  If  none  of  the  observation  node’s  parents  are 
shared  attributes,  then  a  link  between  these  routers  is  added  to  the  topology. 

If  a  down  observation  node  has  more  than  three  parents,  the  heuristic  assumes 
the  observation  node  represents  an  IP  tunnel  or  a  web  service.  Often,  any  routers  or 
links  that  can  be  learned  will  have  already  been  captured  by  other  observation  nodes, 
which  is  why  these  observation  nodes  are  processed  last.  The  heuristic  assesses  each 
of  these  observation  nodes  to  see  if  there  are  any  non-shared  attribute  SRG  parents 
not  previously  identified  as  routers  or  point-to-point  links.  If  any  such  parents  are 
found,  up  to  two  parents  are  added  to  the  set  of  routers  along  with  links  between 
them  if  neither  the  observation  node  nor  any  of  its  parent  SRGs  is  a  shared  attribute. 
Finally,  the  topology  extraction  heuristic  looks  at  routers  identified  at  the  peering 
points  and  adds  them  as  gateways. 

The  heuristic  EXTRACT-TOPOLOGY  evaluates  a  SHRINK-stylc  bipar¬ 
tite  causal  graph  representing  network  fault  propagation,  and  extracts  the  network 
topology  from  the  causal  graph.  The  algorithm  produces  a  graph  of  nodes  repre¬ 
senting  routers,  edges  representing  point-to-point  links,  and  identifies  IP  tunnels.  On 
an  undigested  graph  the  algorithm  always  returns  an  isomorphism  of  the  original 
topology  as  shown  below. 

Without  loss  of  generality,  consider  the  customer  topology  extraction  heuristic 
in  Figure  4.24,  with  the  following  assumptions  about  the  network  topology  graph.  The 
graph  is  a  simple,  connected  graph  with  at  least  two  routers  1 .  Shared  attributes, 
gateway  labels,  and  IP  tunnels  are  not  contained  in  the  graph. 

The  following  assumptions  about  the  causal  graph  construction  from  the  net¬ 
work  topology  are  made.  First,  the  causal  graph  is  correctly  constructed  from  a 

lrThe  requirement  to  have  more  than  one  router  in  a  network  graph  is  consistent  with  reverse¬ 
engineering  a  digest  created  using  the  digest  creation  algorithm  (Figure  4.1).  An  isolated  router 
influences  no  IP  links,  hence  the  algorithm  would  prune  an  SRG  representing  an  isolated  router 
from  a  causal  graph. 
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connected  network  topology  graph.  Second,  the  SRG  nodes  are  routers  and  point-to- 
point  links,  and  the  observation  nodes  are  IP  links  between  adjacent  routers.  Each 
pair  of  adjacent  routers  has  exactly  one  IP  link  observation  node  modeled  between 
them. 

Let  a  network  topology  as  described  above  be  the  graph  G  =  ( V,E ),  where 
V  consists  of  a  set  of  two  or  more  connected  routers  and  E  consists  of  a  set  of 
point-to-point  links  connecting  the  routers  in  V.  Let  the  bipartite  causal  graph  be 
D  =  (S,  O),  where  the  SRG  set  S  models  all  routers  and  point-to-point  links  in 
G,  and  the  observation  set  O  models  observations  of  all  IP  links  between  each  pair 
of  adjacent  routers  in  V.  Let  the  network  graph  G'  =  (V7,  E')  be  the  topology 
constructed  from  reverse-engineering  D.  The  heuristic  EXTRACT-TOPOLOGY 
(Figure  4.22)  iterates  through  observation  nodes  o  G  O  to  add  edges  e'  G  Er  and 
previously  undiscovered  routers  v'  G  V'  to  G'.  The  function  adj()  is  used  as  defined 
in  Equation  4.1.  To  prove  the  correctness  of  EXTRACT-TOPOLOGY  to  reverse- 
engineer  an  isomorphism  of  G,  namely  G’  this  subsection  will  prove  by  induction  that 
each  step  in  adding  the  routers  and  edges  to  G'  is  correct. 

The  following  axiom  related  to  construction  of  the  dependency  graph  D  must 
be  true  for  the  Heuristic  in  Figure  4.22  to  return  an  isomorphism  G'  of  the  original 
topology  G  from  D. 

Axiom  1.  (Bipartite  Causal  Graph  Construction). 

An  IP  link  directly  connecting  two  routers  is  modeled  as  an  observation  node  and 
depends  on  exactly  three  SRG  parent  nodes  modeling  two  routers  and  the  point-to- 
point  link  between  the  routers.  The  SRG  modeling  the  point-to-point  link  affects  no 
other  observation  nodes.  An  SRG  modeling  a  router  may  affect  many  observation 
nodes,  but  a  stub  router  affects  only  one. 

Lemma  1.  Provided  that  the  above  assumptions  and  Axiom  1  are  true,  an  observa¬ 
tion  node  oGO  represents  an  IP  link  between  two  adjacent  routers  iff  \adj(o)\  =  3. 

Proof.  By  Axiom  1,  an  observation  node  representing  a  point-to-point  IP  link  between 
two  adjacent  routers  is  modeled  as  an  observation  node  dependent  on  3  SRG  parent 
nodes. 
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Suppose  an  observation  node  o  G  O  exists  such  that  \adj(o)\  3.  Then  by 
Axiom  1,  o  does  not  depend  on  precisely  3  SRG  nodes,  and  hence  o  does  not  represent 
an  IP  link.  □ 

Lemma  2.  Provided  that  the  above  assumptions  and  Axiom  1  are  true,  an  SRG 
node  s  G  S  represents  either  a  point-to-point  link  or  a  stub  router  iff  \ad;j(s)\  =  1. 

Proof.  By  Axiom  1,  an  SRG  modeled  as  point-to-point  link  or  a  stub  router  affect 
exactly  one  observation  node  modeling  an  IP  link. 

Suppose  an  SRG  s  models  a  point-to-point  link  or  stub  router  such  that 
\adj(s)\  1.  There  are  two  cases. 

Case  1:  \adj(s)\  =  0.  SRG  s  affects  no  IP  links,  hence  s  is  an  isolated  compo¬ 
nent  violating  the  assumptions  that  G  contains  more  than  one  router  in  a  connected 
simple  graph. 

Case  2:  \adj(s)\  >  1.  SRG  s  affects  more  than  one  IP  link,  hence  either  s  is 
not  a  point-to-point  link  or  stub  router,  or  G  is  not  a  simple  graph.  □ 

Lemma  3.  Provided  that  the  above  assumptions  and  Axiom  1  are  true,  an  SRG 
node  s  G  S  represents  a  router  that  is  not  a  stub  router  iff  \adf(s)\  >  2. 

Proof.  By  Axiom  1,  an  SRG  modeled  as  a  non-stub  router  affects  exactly  more  than 
one  observation  node  modeling  an  IP  link. 

Suppose  SRG  .s  models  a  router  that  is  not  a  stub  router  and  \adj(s)\  <  2. 
There  are  two  cases. 

Case  1:  \adj(s)\  =  0.  SRG  s  affects  no  IP  links,  hence  s  is  an  isolated  compo¬ 
nent  violating  the  assumptions  that  G  contains  more  than  one  router  in  a  connected 
simple  graph. 

Case  2:  \adj(s)\  =  1.  SRG  s  affects  exactly  one  IP  link,  hence  by  Lemma  2 
either  s  is  either  a  point-to-point  link  or  stub  router.  □ 

Many  lines  in  the  heuristic  (Figure  4.22)  are  not  executed  when  processing  an 
undigested  graph  using  the  above  assumptions  and  axiom.  The  following  two  lemmas 
establish  that  many  lines  in  the  heuristic  will  not  execute  on  such  a  graph,  and  thus 
can  be  removed  to  increase  clarity. 

Lemma  4.  Provided  that  the  above  assumptions  and  Axiom  1  are  true,  all  observa¬ 
tion  nodes  o  G  O  have  degree  \adj(o)\  =  3. 

Proof.  By  the  assumptions  only  IP  links  are  modeled  in  the  graph,  and  by  Lemma  1, 
IP  links  have  degree  \adj(o)  =  3|. 
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Suppose  an  observation  node  o  e  O  does  not  have  a  degree  of  3.  Then  by 
Lemma  1,  o  does  not  model  an  IP  link.  By  the  above  assumptions  observation  nodes 
only  model  IP  links,  hence  o  is  not  in  the  model.  □ 

Lemma  5.  Provided  that  the  above  assumptions  and  Axiom  1  are  true,  the  algorithm 
correctly  populates  the  temporary  set  of  candidate  point-to-point  links  C  with  point- 
to-point  links  and  stub  routers. 

Proof.  Lines  3-5  populate  C  with  precisely  the  set  of  point-to-point  links  and  stub 
routers  by  Lemma  2. 

Suppose  a  point-to-point  link  or  stub  router  SRG  s  G  S'  is  not  placed  in  set  C. 
Then  there  does  not  exist  a  single  o  e  adj(s )  such  that  \adj(o)\  =  3,  and  by  Lemma 
2,  s  is  neither  a  point-to-point  link  nor  a  stub  router.  □ 

The  heuristic  EXTRACT-TOPOLOGY  (Figure  4.22)  can  now  be  simpli¬ 
fied.  The  simplification  is  not  necessary,  but  by  removing  lines  that  are  not  executed 
on  a  causal  graph  constructed  from  the  network  topology  graph  described  above,  a 
streamlined  version  of  the  algorithm  can  be  used  for  the  proof.  First,  by  the  above 
assumptions  G  is  an  undigested  graph,  so  references  to  Lup  inserted  by  the  digest 
creation  algorithm,  can  be  removed  from  lines  4,  6,  and  26.  The  loop  in  lines  35-37 
can  similarly  be  removed.  By  Lemma  4,  lines  7-11  can  be  removed  since  an  obser¬ 
vation  node  will  not  have  a  degree  of  1  or  2.  By  Lemma  5,  all  point-to-point  links 
are  added  to  container  C,  so  lines  17  and  18  can  be  removed.  By  the  assumptions, 
the  graph  has  neither  IP  tunnels  nor  gateway  labels,  so  lines  21-34  can  be  removed. 
Similarly,  the  tunnel  set  T  can  be  removed  from  line  1.  Since  shared  attributes  are 
not  considered,  the  set  SA  can  be  replaced  by  0.  Line  3  can  now  be  rewritten  and 
lines  13  and  14  removed  since  the  conditional  if  (ad)  (o)  fl  0)  ^  0  always  evaluates  to 
FALSE.  The  resulting  simplified  version  of  the  algorithm  is  depicted  in  Figure  4.24. 

Using  the  simplified  version  of  the  algorithm  constructed  from  the  heuristic  in 
Figure  4.24,  now  the  above  axiom  and  lemmas  will  be  applied  to  prove  by  induction 
on  the  number  of  IP  links  that  EXTRACT-TOPOLOGY  creates  a  G'  that  is 
isomorphic  to  an  arbitrary  G  using  D. 
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Theorem  1.  Provided  that  the  above  assumptions  and  Axiom  1  are  true, 
EXTRACT-TOPOLOGY  correctly  reverse-engineers  an  isomorphism  G'  of  an  ar¬ 
bitrary  graph  G  with  k  IP  links,  from  a  bipartite  causal  graph  D  that  models  depen¬ 
dencies  in  G. 

Proof.  Base  Case.  Consider  an  arbitrary  graph  with  k  —  1  IP  links,  consisting  of  two 
routers  and  a  point-to-point  link  (Figure  4.25  (a)).  By  Axiom  1,  the  IP  link  will  be 
modeled  as  o  E  O,  dependent  on  two  routers  and  a  point-to-point  link,  each  modeled 
as  an  seS.  The  loop  in  lines  3-5  will  add  the  point-to-point  link  and  both  routers 
to  set  C.  Since  \adj(o)\  =  3  by  Lemma  1  and  set  P  =  0,  lines  7  and  8  evaluate  to 
true.  Since  all  three  SRGs  are  in  set  C,  set  R  =  0,  and  \adj(s)  \  =  1  for  each  s  E  S. 
A  tie-breaker  adds  one  of  the  SRGs  s  E  S  to  P.  Next,  the  other  two  SRGs  s  6  S 
are  added  to  the  router  set  R  in  line  10.  Finally,  in  step  11  an  edge  between  the  two 
routers  in  R  is  added  to  L.  The  algorithm  completes  with  V'  =  R  and  E'  =  L.  The 
resulting  graph  G'  contains  two  routers  with  a  point-to-point  link  between  them. 

Inductive  Step.  Assume  that  an  arbitrary  graph  with  k  IP  links  is  always 
correctly  reverse  engineered,  such  that  G'  is  isomorphic  to  G.  It  must  be  shown  that 
EXTRACT-TOPOLOGY  creates  a  G'  isomorphic  to  an  arbitrary  graph  G  with 
k  +  HP  links. 

Suppose  a  bipartite  dependency  graph  D  has  been  created  from  an  arbitrary 
graph  G  with  k  +  1  IP  links  using  Axiom  1.  Consider  temporarily  removing  the 
( k  +  l)th  observation  node  o  E  O,  all  SRG  parents  s  E  adj(o )  such  that  \adj(s)\  =  1, 
and  all  edges  into  o.  Assume  that  the  inductive  hypothesis  is  true,  and  that  the  first 
k  observation  nodes  have  been  correctly  reverse  engineered  to  create  an  isomorphic 
graph  G"  C  G.  By  Lemma  1,  the  removed  ( k  +  l)th  observation  node  depends  on  a 
point-to-point  link  and  two  routers.  By  Lemma  2  SRG  s  E  S  models  a  point-to-point 
link  or  stub  routers  iff  \adj(s)\  =  1,  hence  these  SRGs  have  been  temporarily  removed. 
Clearly  G"  is  missing  a  link  and  possibly  a  stub  router  that  is  in  G.  To  prove  that 
EXTRACT-TOPOLOGY  creates  a  G'  isomorphic  to  G,  the  k  +  1st  observation 
node  and  removed  point-to-point  link  SRG,  and  any  stub  router  SRG,  must  now  be 
added  to  D.  The  (k  +  l)th  observation  node  must  now  be  reverse  engineered  and 
added  to  G",  creating  G' .  There  are  three  cases  as  shown  in  Figure  4.25  (b). . .  (d). 
The  missing  components  in  G"  after  removing  the  (k  +  l)th  IP  link  oj.+ i  are  illustrated 
with  dashed  lines.  In  each  case  by  Lemma  1  \adj(o)\  =  3  and  by  Lemma  2  the  SRG 
s  E  S  modeling  the  point-to-point  link  in  adj(o )  has  degree  of  1.  The  algorithm 
will  add  the  SRG  s  E  S  representing  the  point-to-point  link,  along  with  any  stub 
routers  to  temporary  set  C  in  lines  3-5.  Lines  7  and  8  always  evaluate  to  true  since 
\adj(o)\  =  3  by  Lemma  1,  and  ( adj(o )  fl  P)  —  0  since  G  is  a  simple  graph.  The  three 
cases  follow. 

Case  1:  A  point-to-point  link  and  stub  router  are  in  G,  but  missing  from  G" 
(Figure  4.25  (b)).  Either  an  SRG  modeling  the  missing  stub  router  or  the  point-to- 
point  link  in  adj(o)  is  added  to  P  in  line  9.  In  line  10,  the  other  two  SRGs  are  added 
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to  R.  One  router,  s  E  S  has  been  previously  reverse  engineered,  but  since  s  E  R, 
R  U  {s}  =  R.  In  line  11,  a  link  is  added  between  the  two  routers.  Now  that  the 
components  in  G  missing  from  G"  have  been  added,  G'  is  isomorphic  to  G. 

Case  2:  A  link  in  G  is  missing  from  G",  and  G"  consists  of  two  disconnected 
components  (Figure  4.25  (c)).  Let  o  E  O  be  the  k  +  1st  observation  node.  Since  both 
routers  s  E  adj(o)  have  been  reverse-engineered  by  the  first  k  observation  nodes, 

| ac/7 (s) |  >  1  for  both  SRG  nodes  representing  the  routers.  Thus,  \C\  =  1,  and  C 
contains  just  the  point-to-point  link.  The  link  is  added  to  P  in  line  9  and  both 
routers  are  re-added  to  R  in  line  10.  Since  s  E  R  for  each  router,  R  U  {s}  =  R.  In 
line  11,  the  link  is  added  between  the  two  routers.  Now  that  the  link  in  G  missing 
from  G"  has  been  added,  G'  is  isomorphic  to  G. 

Case  3:  A  link  in  G  is  missing  from  G",  and  G"  consists  of  one  connected 
component  (Figure  4.25  (d)).  Let  o  E  O  be  the  k  +  1st  observation  node.  Since  both 
routers  s  E  adj(o )  have  been  reverse-engineered  by  the  first  k  observation  nodes, 
\adj(s)\  >  1  for  both  SRG  nodes  representing  the  routers.  Thus,  \C\  =  1,  and  C 
contains  just  the  point-to-point  link.  The  link  is  added  to  P  in  line  9  and  both 
routers  are  re-added  to  R  in  line  10.  Since  s  E  R  for  each  router,  R  U  {s}  =  R.  In 
line  11,  the  link  is  added  between  the  two  routers.  Now  that  the  link  in  G  missing 
from  G"  has  been  added,  G'  is  isomorphic  to  G. 

□ 

Any  externally  visible  components  that  are  missing  from  the  extracted  topol¬ 
ogy  are  then  added.  The  topology  is  then  fed  to  the  next  heuristic,  which  is  property 
evaluation. 

Topology  extraction  for  the  peer-peer  domain  relationship  (Figure  4.23)  pro¬ 
ceeds  similarly  to  extraction  for  the  provider-customer  relationship  with  the  following 
three  modifications.  First,  the  phase  to  identify  components  based  on  point-to-point 
links  first  processes  all  non-SA  observation  nodes,  then  all  SA  observation  nodes.  Sec¬ 
ond,  for  the  case  of  an  SA  observation  node  with  two  SRG  parent  nodes,  the  heuristic 
adds  only  one  router.  Third,  the  heuristic  processes  observation  nodes  as  well  as  SRG 
nodes  to  identify  gateway  routers,  and  creates  a  peering  link  instead  of  a  transit  link. 

The  property  evaluation  phase  shown  in  Figure  4.26,  measures  the  four  sen¬ 
sitive  properties  previously  identified:  reachability,  domain  diameter,  number  of 
routers,  and  degree  of  the  node  with  the  highest  degree.  This  heuristic  takes  the 
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topology  generated  by  the  heuristic  in  Figure  4.22  and  a  pair  of  gateway  routers  for 
reachability  testing  as  input  parameters. 

The  heuristic  first  iterates  through  the  routers  in  the  topology  to  find  the  node 
with  the  maximum  degree.  Next,  the  heuristic  temporarily  removes  all  peering  links 
from  the  topology.  If  the  reachability  nodes  have  been  identified  in  the  topology  as 
gateways  and  a  path  exists  between  them,  internal  reachability  is  determined  to  be 
true. 

To  calculate  the  number  of  routers,  the  heuristic  first  needs  to  resolve  unre¬ 
solved  edges.  A  router  has  an  unresolved  edge  in  the  topology  if  the  router  has  an 
edge  to  Lup  in  the  causal  graph.  The  heuristic  attempts  to  resolve  these  unresolved 
edges  by  checking  to  see  if  a  link  between  routers  with  unresolved  edges  can  be  es¬ 
tablished.  Any  number  of  routers  may  resolve  their  edges  with  a  single  router  having 
an  unresolved  edge.  This  detail  is  consistent  with  the  logical  OR  implementation 
in  the  digest  algorithm  (Figure  4.1),  since  an  edge  to  Lup  represents  one  or  more 
connections. 

The  heuristic  adds  all  routers  with  an  unresolved  edge  to  set  U,  and  colors 
each  router  white.  For  each  pair  of  routers  in  set  U,  if  the  routers  are  not  adjacent 
and  both  routers  are  not  gateways,  then  the  routers  are  colored  black  and  an  edge  is 
added  between  the  routers.  After  adding  these  edges,  which  are  a  necessary  step  to 
find  the  lower  bound  on  the  network  diameter,  the  heuristic  next  checks  to  see  if  there 
is  a  gateway  router  that  is  colored  white.  Since  external  information  is  visible  with 
respect  to  the  domain  attacking  a  digest  and  it  is  not  common  practice  to  peer  with 
multiple  domains  from  a  single  gateway  router,  the  heuristic  explains  the  unresolved 
edge  by  adding  an  internal  router  to  the  topology.  If  instead  no  gateway  routers 
are  colored  white  but  one  or  more  internal  routers  are  colored  white ,  the  heuristic 
explains  the  first  instance  as  an  external  connection  to  an  unknown  domain,  and 
a  subsequent  instance  as  an  internal  router.  Dual  homing  is  common  practice  for 
network  domains,  but  dual  homing  with  an  arbitrary  number  of  connections  is  not. 
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If  an  internal  router  is  added  to  a  domain  topology,  every  router  in  set  U  will  be  able 
to  resolve  its  unresolved  edge  with  the  added  router. 

Finally,  the  heuristic  uses  the  Floyd- Warshal  algorithm  [10]  to  determine  the 
network  diameter.  The  Floyd- Warshal  algorithm  computes  the  longest  of  the  shortest 
paths  in  a  graph. 

3.  Evaluating  Scalability 

The  first  ten  double  failure  scenarios  from  each  topology  were  instrumented  to 
measure  the  real  elapsed  time  using  the  Linux  time  command  via  Cygwin.  To  evaluate 
the  scalability  metric  E  the  inference  time  was  measured  using  both  the  full  disclosure 
and  graph  digest  approaches.  The  mean  of  each  set  of  ten  measurements  was  used 
as  input  to  equation  (3.11).  The  simulations  were  run  on  a  1.61  GHz  computer  with 
960  MB  RAM  running  Windows  XP,  service  pack  3. 

G.  CONCLUSION 

This  chapter  outlines  the  test  methodology  used  to  evaluate  the  hypothesis 
that  an  approach  using  the  framework  in  Chapter  III  achieves  better  accuracy  than 
isolated  inference.  Six  different  network  domain  topologies  built  from  realistic  net¬ 
work  topology  atoms  were  used  with  varying  sizes  and  two  different  domain  peering 
relationships.  An  additional  topology  based  on  the  Abilene  [1]  backbone  infrastruc¬ 
ture  was  used.  The  prototype  graph  digest  algorithm  (Figure  4.1)  was  evaluated  using 
two  different  intra-domain  fault  localization  algorithms:  SHRINK  and  SCORE.  The 
variety  in  topologies,  as  well  as  the  use  of  two  different  inference  algorithms,  serves 
to  evaluate  the  generality  of  the  framework  described  in  Chapter  III.  By  measuring 
performance  in  terms  of  the  accuracy,  privacy,  and  scalability  metrics  outlined  in 
Chapter  III,  the  evaluation  results  in  Chapter  V  establish  the  feasibility  of  the  graph 
digest  approach  for  cross-domain  fault  localization. 
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Extract-T  OPOLOGY^Digest  D  =  ( SRG  S,  Observation  O) ,  Shared  Attribute  SA ) 
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initialize  router  set  R,  link  set  L,  and  tunnel  set  T  to  0 
initialize  temporary  set  variables  C  and  P  to  0 

t>  Identify  candidate  point-to-point  links 

for  each  SRG  s  G  (S  -  SA) 

do  if  there  exists  a  single  o  £  adj(s)  such  that  o  ^  Lup  and  \adj(o)\  <  3 
then  add  s  to  C 

[>  Identify  components  based  on  point-to-point  links 

for  each  observation  node  o  ^  Lup  in  O 
do  if  \adj(o)  =  1 

then  add  s  £  (adj(o)  fl  (S  —  SA))  to  R 
elseif  \adj(o)\  =  2 

then  add  each  s  £  (adj(o)  —  (SA  U  P))  to  R 

add  edge  (sj,  Sj^i)  for  each  s*,  Sj^i  £  (adj(o)  fl  R)  to  L 
elseif  \adj(o)\  =  3 

then  if  (adj(o)  fl  S A)  ^  0 

then  add  each  s  £  (adj(o)  —  (Sbl  U  P)  to  R 
else  if  (adj(o)  fl  P)  =  0 

then  add  s  €  ((adj(o)  fl  C)  —  R)  with  min.  |adj(s)l  to  P 
if  (adj(o)  fl  P)  =  0 

then  add  s  £  (adj(o)  —  R)  with  min.  \adj(s)\  to  P 
add  each  s  £  (adj(o)  —  P)  to  R 
add  edge  between  each  s  £  (adj(o)  fl  R)  to  L 
t>  Identify  gateway  routers  based  on  shared  attribute  peering  points 
for  each  s  £  (S  fl  S A)  such  that  s  is  a  peering  point  shared  attribute 
do  for  each  o  £  (O  fl  adj(s)) 

do  if  | adj(o)  —  SA  \  <  2 

then  label  each  s  £  (adj(o)  fl  R)  as  gateway 

add  edge  between  each  s  £  (adj(o)  fl  R)  and  label  as  transit 
[>  Identify  routers  based  on  logical  tunnels 
for  each  observation  node  o  ^  Lup  in  O  such  that  \adj(o)\  >  3 
do  initialize  a  new  tunnel  t  to  0 
if  \adj(o)  fl  i?|  <  2 

then  if  \adj(o)  D  R\  =  0 

then  add  s  £  (adj(o)  —  (P  U  Sk4))  with  max.  \adj(o)\  to  R 
add  s  £  (adj(o)  —  (R  U  P  U  S A))  with  max.  | adj(o)  \  to  R 
for  each  s  €  adj (o)  fl  R 

do  add  router  s  to  tunnel  t 
T  <—  T  U  {1} 

[>  Identify  unresolved  edges 

for  each  s  €  adj(Lup) 

do  if  s  £  R 

then  mark  s  with  unresolved -edge  <—  true 


Figure  4.22.  Heuristic  to  extract  router,  link,  and  tunnel  sets  of  customer  topology 
from  a  graph  digest. 
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ExTRACT-T0P0L0GY(D?'gest  D  =  ( SRG  S,  Observation  O),  Shared  Attribute  SA) 
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initialize  router  set  R,  link  set  L,  and  tunnel  set  T  to  0 
initialize  temporary  set  variables  C  and  P  to  0 

>  Identify  candidate  point-to-point  links 

for  each  SRG  s6(S-  SA) 

do  if  there  exists  a  single  o  £  adj(s)  such  that  o  7=  Lup  and  \adj(o)\  <  3 
then  add  s  to  C 

>  Identify  components  based  on  point-to-point  links 

[>  Do  for  all  non-SA  observation  nodes  first,  then  for  all  SA  observation  nodes 
for  each  observation  node  o  7=  Lup  in  O 
do  if  \adj(o)\  =  1 

then  add  s  £  ( adj(o )  n  (S  —  SA))  to  R. 
elseif  \adj(o)\  =  2 

then  if  o  £  (O  fl  SA)  such  that  o  is  a  peering  point  shared  attribute 
then  add  s  £  (adj(o)  —  (SA  U  P))  with  max.  |ad;(o)|  to  R 
else  add  each  s  £  (adj(o)  —  (SA  U  P))  to  R 

add  edge  to  L  between  each  s  £  (adj  (o)  fl  R) 
elseif  |acJ7'(o)|  =  3 

then  if  (adj(o)  n  SA)  7^  0 

then  add  each  s  £  (adj(o)  —  (SA  U  P)  to  fl 
else  if  (adj(o)  fl  P)  =  0 

then  add  s  £  ((adj(o)  n  C)  —  R)  with  min.  |adj(s)|  to  P 
if  (adj(o)  fl  P)  =  0 

then  add  s  £  (adj(o)  —  R)  with  min.  \adj(s)\  to  P 
add  each  s  £  (adj(o)  —  P)  to  R 
add  edge  between  each  s  £  (adj(o)  fl  R)  to  L 

>  Identify  gateway  routers  based  on  shared  attribute  peering  points 
for  each  s  £  (S  D  SA)  such  that  s  is  a  peering  point  shared  attribute 

do  for  each  o  £  (O  n  adj(s)) 

do  if  | adj(o)  —  SA  |  <  2 

then  label  each  s  £  (adj(o)  Pi  R)  as  gateway 

add  edge  between  each  s  £  (adj(o)  fl  R)  and  label  as  peering 
for  each  o  £  (O  n  SA)  such  that  o  is  a  peering  point  shared  attribute 
do  if  \adj(o)\  <  2  and  \adj(o)  n  R\  =  1 

then  label  s  £  (adj(o)  fl  I?)  as  gateway 

>  Identify  routers  based  on  logical  tunnels 

for  each  observation  node  o  7=  Lup  in  O  such  that  |  adj (o)  |  >  3 
do  initialize  a  new  tunnel  t  to  0 
if  | adj(o)  n  i?|  <  2 

then  if  | adj(o)  n  P?|  =  0 

then  add  s  £  (adj(o)  —  (P  U  SA))  with  max.  \adj(o)\  to  R 
add  s  £  (adj(o)  —  (R  U  P  U  SA))  with  max.  adj (o)  to  R 
for  each  s  £  adj  (o)  n  R 

do  add  router  s  to  tunnel  t 
T^TU{i} 

>  Identify  unresolved  edges 

for  each  s  £  adj(Lup) 

do  if  s  £  R 

then  mark  s  with  unresolved-edge  <—  TRUE 


Figure  4.23.  Heuristic  to  extract  router,  link,  and  tunnel  sets  of  a  peering  topology 
from  a  graph  digest. 
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Extract-Topology(Z)  jgest  D  =  ( SRG  S,  Observation  O) ,  Shared  Attribute  SA) 

1  initialize  router  set  R  and  link  set  L  to  0 

2  initialize  temporary  set  variables  C  and  P  to  0 

t>  Identify  candidate  point-to-point  links 

3  for  each  SRG  s  £  S 

4  do  if  there  exists  a  single  o  £  adj(s)  such  that  \adj(o)\  <  3 

5  then  add  s  to  C 

t>  Identify  components  based  on  point-to-point  links 

6  for  each  observation  node  o  £  O 

7  do  if  \adj(o)\  =  3 

8  then  if  ( adj(o )  D  P)  =  0 

9  then  add  s  £  (( adj(o )  fl  C)  —  R)  with  min.  \adj(s)\  to  P 

10  add  each  s  £  ( adj(o )  —  P)  to  R 

11  add  edge  between  each  s  £  (adj(o)  fl  R)  to  L 

Figure  4.24.  Algorithm  to  extract  router,  link,  and  tunnel  sets  of  customer  topology 
from  an  undigested  graph  digest. 


(a)  (b)  (c)  (d) 

Figure  4.25.  (a)  Base  Case  and  (b). . .  (d)  possible  cases  for  G" 
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Evaluate-Properties(T' opology  =  ( Routers  R,  Links  L)) 
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reachability  <—  FALSE 
diameter  <—  0 
number -of  -routers  <—  0 
max  jnode -degree  <—  0 
t>  Determine  reachability 
Temporarily  remove  all  peering  links  from  L 
if  reachability  nodes  rr  £  R  and  r2  €  R  are  labeled  as  gateway 
then  if  a  path  r\  r2  exists 

then  reachability  <—  true 
Replace  all  peering  links 
[>  Find  the  maximum  node  degree 
for  each  router  r  £  R 

do  if  degree  of  r  >  max -node -degree 

then  max -node -degree  <—  degree  of  r 
t>  Find  the  number  of  routers 

Add  all  routers  in  R  that  have  an  unresolved  edge  to  temporary  set  XJ 
Color  each  router  in  U  as  white 
external -connection -added  <—  FALSE 
while  exists  a  router  in  U  colored  white 
do  for  all  r  £  U 

do  for  all  (s  yf  r)  £  U  such  that  s  is  not  marked  gateway 
do  if  edge  (r,  s)  ^  L 

then  add  (r,  s)  to  L 
color  r  black 
color  s  black 

if  exists  r  £  U  colored  white  and  marked  gateway 
then  add  new  router  s  to  R 
add  s  to  U 
color  s  white 

else  if  exists  r  £  U  colored  white  and  not  marked  gateway 
then  if  external -connection -added  =  FALSE 

then  external -connection -added  <—  true 
color  r  black 

else  add  new  router  s  to  R 
add  s  to  U 
color  s  white 

number -of  -routers  <—  |i?| 

t>  Find  the  network  diameter 

Run  Floyd- Warshal  algorithm  on  T  to  derive  diameter 


Figure  4.26.  Heuristic  to  evaluate  reachability,  diameter,  number  of  routers,  and 
maximum  node  degree  sensitive  properties  from  a  graph  digest. 


V. 


EVALUATION  RESULTS 


This  chapter  presents  the  evaluation  results  for  the  model  presented  in  Chapter 
III.  The  evaluation  methodology  described  in  Chapter  IV  is  used. 


A.  PROVIDER-CUSTOMER  SETTING 

1.  Accuracy  Evaluation  Results  -  SHRINK 

For  all  but  5  of  377  tested  scenarios,  ad  >  as,  resulting  in  non- negative  ac¬ 
curacy  improvement  scores  A  (Figure  5.1).  The  average  A  score  was  0.31,  0.36,  and 
0.42  for  the  small,  medium,  and  large  topology  respectively.  The  maximum  score  for 
each  topology  was  1.0.  There  was  an  accuracy  improvement  in  55%,  66%,  and  68%  of 
the  test  scenarios  for  the  small,  medium,  and  large  topology  respectively  (indicated 
by  an  oval  in  Figure  5.2).  The  results  indicate  that  scaling  the  domain  size  has  little 
impact  on  the  accuracy  of  Bd,  B\  U  B2,  or  A  with  respect  to  Bt- 

All  instances  of  accuracy  A  =  1.0  in  Figure  5.1  reflect  failure  scenarios  for 
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Figure  5.1.  Histogram  of  A  metric  for  the  provider-customer  setting. 
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Figure  5.2.  CDF  of  A  metric  for  the  provider-customer  setting. 


which  only  components  in  the  provider  (Domain  1)  have  failed.  As  discussed  in 
Chapter  IV,  failures  in  the  provider  domain  are  not  observable  by  the  provider.  While 
the  customer  can  observe  failures,  inference  by  the  customer  in  isolation  will  only 
result  in  either  incorrectly  identified  components  in  the  customer  domain,  or  finger 
pointing  at  the  provider.  While  it  is  not  surprising  that  isolated  inference  was  wholly 
unsuccessful  in  identifying  the  failures  (as  =  0.0)  for  these  scenarios,  the  graph  digest 
approach  did  achieve  perfect  (ad  =  1.0)  accuracy. 

The  five  negative  accuracy  results  stem  from  double-failure  scenarios  of  com¬ 
ponents  in  the  same  neighborhood  of  a  router  or  switch.  Four  of  the  five  failure 
scenarios  with  a  negative  score  were  essentially  the  same  failure  scenario  that  oc¬ 
curred  four  times.  The  failure  scenarios  in  the  small  topology  graph  were  {Fi,i2n}, 
{Oi,i?n},  and  {F2,Rn}.  Considering  that  F\ ,  Ol5  and  F2  are  indistinguishable  in 
the  inference  model,  they  were  essentially  the  same  failure  scenario.  Furthermore, 
the  failure  scenario  in  the  medium  topology  {F2,Rn}  was  identical  to  the  failure 
scenario  by  the  same  name  in  the  small  topology.  For  each  of  these  four  scenar¬ 
ios,  Bd  =  Bu  =  {02,-Rh},  resulting  in  a  false  positive  (02),  and  a  false  negative 
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Figure  5.3.  Histogram  of  C  metric  for  the  provider-customer  setting. 


({Pi,  0\,  F2}).  Isolated  inference  (U *Pj)  hypothesized  {Fn}  resulting  in  a  single 
false  negative  result. 

The  negative  score  in  the  large  topology  occurred  for  the  failure  scenario 
{F3,P10_ii},  where  Pio-ii  is  the  point-to-point  link  between  Rw  and  Fn.  For  this 
scenario  Bd  =  {F3,Fn},  with  a  false  positive  result  (Fn)  and  a  false  negative  result 
(Pio-n).  Bu  =  {P3}  and  UjPj  =  {Pi0-n},  each  with  a  single  false  negative  result. 

The  cost  metric  C  depicted  in  Figures  5.3  and  5.4  shows  a  minimal  cost  in 
using  the  digest  approach.  The  cost  to  inference  equaled  zero  in  all  but  nine  test  cases, 
meaning  that  the  digest  approach  achieved  the  same  accuracy  as  the  full  disclosure 
approach  in  97.6%  of  the  377  tested  scenarios.  The  average  score  was  0.005,  0.002, 
and  0.006  for  the  small,  medium,  and  large  topologies  respectively,  with  a  maximum 
value  of  0.17  in  each  topology.  Each  of  the  nine  failure  scenarios  for  which  C  >  0.0 
occurred  when  two  components  failed  simultaneously  in  the  same  neighborhood  of 
a  router  or  switch.  For  each  of  these  nine  scenarios,  the  best  explanation  using  the 
graph  digest  approach  returned  one  correct  and  one  incorrect  failure  while  the  full 
disclosure  approach  returned  one  correct  failure  and  did  not  hypothesize  a  second 
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Figure  5.4.  CDF  of  C  metric  for  the  provider-customer  setting. 


failure. 

The  digest  algorithm  in  Figure  4.1  potentially  degrades  A.  The  logical-OR 
treatment  for  edges  to  Lup  removes  information  about  conditional  dependencies  on 
an  SRG  from  the  joint  distribution,  reducing  the  probability  the  SRG  is  up  given  the 
state  of  the  remaining  dependent  observation  nodes.  Additionally,  the  aggregation 
step,  exaggerated  by  using  uniform  prior  probabilities,  lumps  additional  SRGs  into 
a  best  explanation  for  a^.  Since  all  equipment  identified  in  a  hypothesis  would  have 
to  be  checked,  all  SRGs  that  have  been  aggregated  into  an  SRG  are  unraveled  into 
a  best  explanation.  Consequently,  aggregation  potentially  adversely  affects  hd,  and 
ultimately  ad  and  A.  In  spite  of  the  information  loss,  the  graph  digest  approach 
performed  remarkably  well  as  discussed  above. 

The  Z  values  using  the  Wilcoxon  Signed-Rank  Test  are  8.21,  8.92,  and  5.44 
for  the  small,  medium,  and  large  topologies  respectively.  The  hypothesis  II  \  passed 
the  95%  confidence  test  for  SHRINK  in  the  Provider-Customer  setting. 
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Figure  5.5.  Histogram  of  A  metric  for  the  provider-customer  setting. 
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Figure  5.7.  Histogram  of  C  metric  for  the  provider-customer  setting. 


2.  Accuracy  Evaluation  Results  -  SCORE 

For  all  but  5  of  377  tested  scenarios  ad>  a s,  resulting  in  non-negative  accuracy 
improvement  scores  A  (Figure  5.5).  The  average  score  is  0.28,  0.26,  and  0.31  for 
the  small,  medium,  and  large  topology  respectively.  The  maximum  score  for  each 
topology  is  1.0.  An  accuracy  improvement  in  57%,  45%,  34%  of  the  test  scenarios 
was  observed  for  small,  medium,  and  large  topology  respectively  (indicated  by  an  oval 
in  Figure  5.6).  The  results  indicate  that  scaling  the  domain  size  has  little  impact  on 
the  accuracy  of  Bdl  ^  U  &,  or  4  with  respect  to  BT. 

The  five  instances  for  which  accuracy  scores  are  negative  occurred  in  double- 
failure  scenarios.  For  each  of  these  scenarios  isolated  inference  returned  only  one 
identified  failure,  resulting  in  one  correct  explanation  and  one  false  positive  result. 
The  graph  digest  and  full  disclosure  approaches  returned  one  correct  and  one  incorrect 
explanation,  yielding  both  a  false  positive  and  a  false  negative  in  the  hypothesis.  As 
in  SHRINK,  each  of  the  scenarios  with  a  negative  score  occurred  when  two  failures 
occurred  in  the  same  neighborhood  in  the  physical  topology  graph. 

The  cost  metric  C,  depicted  in  Figures  5.7  and  5.8,  is  uniform  at  0.0  across  all 
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Figure  5.8.  CDF  of  C  metric  for  the  provider-customer  setting. 


three  topologies.  This  means  that  for  all  377  tested  scenarios,  Bj  =  Bu.  Interestingly, 
when  isolated  inference  outperformed  the  graph  digest  approach  for  the  five  failure 
scenarios  discussed  above,  isolated  inference  also  outperformed  full  disclosure  for  these 
same  scenarios. 

The  digest  algorithm  in  Figure  4.1  potentially  degrades  A  using  SCORE.  The 
impact  of  the  digest’s  logical-OR  step  to  SCORE  is  to  artificially  inflate  the  hit  ratio 
of  every  SRG  having  more  than  one  “up”  observation  set  member.  The  aggregation 
step  has  no  impact  on  SCORE  as  the  SCORE  algorithm  performs  this  step  during 
preprocessing.  As  with  SHRINK,  all  SRGs  that  have  been  aggregated  into  an  SRG 
are  unraveled  in  a  best  explanation.  In  spite  of  the  information  loss,  the  graph  digest 
approach  performed  remarkably  well  as  discussed  above. 

The  Z  values  using  the  Wilcoxon  Signed-Rank  Test  were  8.83,  8.93,  and  5.78 
for  the  small,  medium,  and  large  topologies  respectively.  The  hypothesis  passed  the 
95%  confidence  test  using  SCORE  in  the  Provider- Customer  setting. 


75 


Small 

Medium 

Large 

Degree 

2.11  (4) 

3.28  (5) 

4.33  (6) 

Diameter 

3.27  (4) 

9.95  (11) 

22.01  (23) 

Routers 

12.06  (15) 

47.71  (51) 

197.97  (201) 

Table  5.1.  Privacy  metric  rMSE  versus  (true  value). 


Node 

Degree 

Domain 

Diameter 

Number 

Routers 

gSTD 

1.09 

0.59 

1.66 

E(X) 

2.01 

0.94 

3.16 

Table  5.2.  Privacy  metric  gSTD  versus  sample  mean. 

3.  Privacy  Evaluation  Results 

The  digest  creation  algorithm  (Figure  4.1)  creates  identical  digests  for  both 
SHRINK  and  SCORE.  As  a  consequence,  the  privacy  results  for  SHRINK  and  SCORE 
are  identical. 

To  compute  the  privacy  protection  for  the  customer,  each  digest  was  attacked 
using  the  heuristic  described  in  Chapter  IV.  The  attack  heuristic  adds  any  missing 
externally  visible  gateway  routers  and  transit  IP  links  to  each  topology  extracted 
from  a  digest.  As  previously  discussed,  no  attempts  were  made  to  hide  information 
and  no  post-processing  of  the  digests  was  performed  to  reduce  the  information  leaked, 
but  rather  the  design  was  tested  to  see  how  much  information  leaked  using  the  simple 
digest  algorithm  described  in  Figure  4.1. 

As  depicted  in  Table  5.1,  the  root  mean  squared  error  (rMSE)  was  high  relative 
to  the  true  value  for  the  sensitive  properties.  The  outcome  for  privacy  evaluation 
means  that  the  information  learned  from  the  attacks  was  generally  far  from  the  true 
values.  Table  5.2  shows  that  the  generalized  standard  deviation  (gSTD)  for  each 
privacy  metric  was  low  compared  to  the  mean.  This  result  means  that  there  is  little 
variation  in  the  amount  of  information  learned  about  each  sensitive  property  from 
each  digest  attack.  These  results  suggest  a  reasonable  level  of  privacy  protection 
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considering  the  use  of  a  prototype  digest  creation  algorithm  for  the  attack  heuristic. 

To  provide  more  detail,  histograms  of  the  attack  estimates  are  shown.  Addi¬ 
tionally,  the  relative  error  of  the  attack  results  versus  the  true  value  for  each  sensitive 
property  was  calculated.  The  results  expressed  with  histograms  and  cumulative  dis¬ 
tribution  functions  (CDF),  less  reachability,  are  presented  for  each  domain  size  in 
Figures  5.9  -  5.14.  The  reachability  results  are  discussed  next. 

The  reachability  metric  is  binary  with  a  one,  the  true  value,  representing 
internal  reachability  between  two  externally  visible  gateways  in  the  customer  domain. 
The  reachability  test  was  conducted  between  gateway  routers  Rll  and  R12  in  Figures 
4.2,  4.5,  and  4.6.  The  gateways  are  2  hops  apart,  with  router  R13  connecting  them. 

Only  7  of  the  377  evaluated  failure  scenarios  identified  the  reachability  between 
the  nodes.  Six  of  the  failure  scenarios  that  revealed  reachability  involved  failure  of 
router  i?13  with  a  second  failure.  The  second  failed  component  in  each  case  caused  a 
failure  observation  that  could  have  been  caused  by  failure  of  Rll  or  R12.  For  example, 
in  the  failure  scenario  {i?13,  F 3},  the  IP  link  between  Rll  and  R12  is  observed  to  be 
down.  Since  the  failure  of  either  Rll  or  R12  would  cause  this  link  to  observe  failure, 
this  is  a  failure  scenario  that  reveals  reachability  between  the  nodes.  The  seventh 
failure  scenario  to  reveal  the  reachability  was  the  failure  scenario  {7711,7712}. 

The  network  diameter  values  measured  by  the  attack  are  presented  in  Figure 
5.9.  The  histogram  clearly  shows  little  variation  in  the  attack  estimates.  At  50% 
mass  of  the  experiments  (Figure  5.10),  the  relative  error  was  75%,  91%,  and  96%  for 
the  small,  medium,  and  large  topologies  respectively.  This  result  bodes  well  for  the 
inherent  protection  provided  by  the  digest  approach  as  a  network  domain  size  scales 
up. 

A  histogram  showing  the  number  of  routers  found  by  the  attack  heuristic 
is  presented  in  Figure  5.11.  The  histogram  shows  fairly  consistent  results  for  each 
topology  size.  As  shown  in  Figure  5.12,  the  relative  error  between  the  number  of 
routers  in  a  network  and  the  number  detected  from  attacking  a  digest  increased 
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Figure  5.9.  Histogram  for  the  diameter  property. 
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Figure  5.11.  Histogram  for  the  number  of  routers  property. 


with  topology  size.  At  50%  mass  of  the  experiments  (Figure  5.12),  the  relative  error 
was  80%,  94%,  and  99%  for  the  small,  medium,  and  large  topologies  respectively. 
Intuitively  these  results  makes  sense  since  each  digest  only  provides  a  small  collection 
of  nodes.  In  general  the  topology  learned  consists  of  a  neighborhood  around  one 
or  two  routers,  and  multiple  failures  whose  neighborhoods  intersect  allow  a  larger 
portion  of  the  topology  to  be  inferred. 

When  a  failure  impacts  an  IP  tunnel,  as  do  78%  of  the  failure  scenarios, 
information  about  the  neighborhood  around  each  router  on  the  tunnel  is  potentially 
revealed.  The  IP  tunnels  do  have  an  inherent  protection  feature  in  that  an  observation 
node  representing  the  IP  tunnel  will  most  likely  have  more  than  three  parents  in  a 
digest.  This  creates  ambiguity  in  reconstructing  router  adjacencies  along  the  tunnel 
for  the  attack  heuristic  used. 

The  topologies  were  seeded  with  an  unfavorable  setting  for  the  node  degree 
sensitive  property  by  placing  a  router  with  high  degree  at  the  gateway  in  the  customer 
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Figure  5.12.  CDF  relative  error  for  the  number  of  routers  property. 


domain.  The  node  aggregation  and  Noisy-OR  steps  performed  by  the  digest  algorithm 
(Figure  4.1)  did  surprisingly  well  in  hiding  the  true  value  of  the  node  degree  (Figures 
5.13  and  5.14).  The  true  degree  was  only  revealed  in  5%  of  the  attacks,  and  the 
property  did  not  scale  with  the  network  domain  size.  At  50%  mass  of  the  experiments 
(Figure  5.14),  the  relative  error  was  50%,  80%,  and  83%  for  the  small,  medium,  and 
large  topologies  respectively.  Better  inherent  protection  would  be  expected  if  no  high 
degree  nodes  were  placed  near  the  gateways  of  the  Domain  2  topology. 

From  the  privacy  results  the  prototype  digest  algorithm  provided  significant 
protection  against  attacks  to  learn  the  sensitive  properties  evaluated.  Using  the  at¬ 
tack  heuristic  to  learn  information  from  the  digests  yielded  fairly  uniform  results 
irregardless  of  the  domain  topology  specific  composition  and  size.  This  low  deviation 
in  results  is  reflected  in  the  low  gSTD  values  in  Table  5.2  and  the  plots  in  Figures  5.9  - 
5.14.  The  rMSE  growth  of  the  estimates  (Table  5.1)  as  the  domain  size  increases  fur¬ 
ther  demonstrates  that  the  attack  reveals  no  more  information  about  a  large  domain 
than  it  does  about  a  small  domain.  The  results  further  suggest  that  a  privacy  metric 
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Figure  5.13.  Histogram  for  the  node  degree  property. 
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E 

1.00 

2.81 

4.84 

Table  5.3.  SHRINK  scalability  results. 


whose  true  value  naturally  grows  with  the  sheer  size  of  a  domain  receives  inherent 
protection  using  a  digest  approach  as  the  size  of  a  network  domain  scales  up.  The 
network  diameter  and  the  number  of  routers  naturally  grow  with  a  network  domain’s 
size,  while  a  high  degree  node  or  an  interior  path  between  two  gateways  remains  fairly 
static:  an  attack  either  finds  it,  or  it  does  not. 

4.  Scalability  Evaluation  Results  -  SHRINK 

To  compute  scalability  E  the  average  elapsed  real  time  to  compute  SHRINK 
results  for  up  to  three  failures  was  measured  for  the  small,  medium,  and  large  topolo¬ 
gies. 

As  expected,  the  SHRINK  running  time  increased  significantly  as  the  number 
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Figure  5.14.  CDF  relative  error  for  the  node  degree  property. 
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Small 

Topology 

Medium 

Topology 

Large 

Topology 

E 

0.14 

0.34 

1.09 

Table  5.4.  SCORE  scalability  results. 


of  SRGs  increased.  The  increase  in  scalability  E  by  using  the  graph  digest  approach 
is  evident  in  Table  5.3.  Of  particular  note,  inference  time  improved  from  hours  to 
milliseconds  on  the  large  topology. 

5.  Scalability  Evaluation  Results  -  SCORE 

To  compute  scalability  E  the  average  elapsed  real  time  to  compute  the  SCORE 
results  for  five  threshold  settings  was  measured  for  the  small,  medium,  and  large 
topologies  respectively. 

Although  the  SCORE  running  time  increased  less  dramatically  than  SHRINK 
as  the  number  of  SRGs  increased,  a  greater  growth  with  full  disclosure  is  evident  than 
with  the  graph  digest  approach.  The  increase  in  scalability  E  is  shown  in  Table  5.4. 
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B.  PEER-PEER  DOMAINS 


1.  Accuracy  Evaluation  Results  -  SHRINK 

The  initial  inference  results  were  puzzling  as  the  as,  au,  and  scores  were  all 
low.  SHRINK  [19]  tends  to  omit  point-to-point  links  and  stub  routers  from  multiple 
failure  scenarios  using  the  default  settings,  instead  attributing  the  evidence  of  failure 
about  these  components  to  an  error  in  the  SRG  database.  Although  merely  a  nuisance 
in  the  provider-customer  setting,  the  problem  became  magnified  in  the  peer-peer 
domain  setting  due  to  the  large  number  of  links  identified  as  cross-domain  SRGs  (e.g. 
the  peering  links  and  links  on  the  web  service  shortest  paths).  Additionally,  failures 
with  low  probability  mass  in  one  domain  caused  ambiguous  inference  results  for  Rj 
in  the  other  domain.  The  SHRINK  model  implemented  did  not  include  a  method  for 
the  inference  to  return  B,  =  0,  which  became  a  necessary  feature  in  the  peer-peer 
domain  setting. 

To  counter  the  issue  of  SRG  omission,  the  prior  probabilities  of  the  SRG  nodes 
were  lowered  from  1CT5  to  1CT3.  After  the  change  the  inference  engine  preferred  to  add 
an  additional  SRG  first,  and  assume  an  incorrect  SRG  database  mapping  second.  To 
correct  the  null  hypothesis  problem,  a  low  probability  “Not  I"  node  was  implemented 
which  indicates  no  failures  internal  to  a  domain.  Using  the  low  probability  node  is 
consistent  with  SHRINK. 

The  accuracy  improvement  metric  A  for  the  peer-peer  topologies  is  depicted 
in  Figures  5.15  and  5.16.  In  all  but  1  of  484  tested  cases,  >  as,  resulting  in 
non-negative  accuracy  improvement  scores  A.  In  the  tested  scenarios,  a  minimum 
accuracy  improvement  of  31%,  30%,  and  41%  was  observed  for  the  small,  medium, 
and  large  topologies  respectively  (highlighted  with  an  oval  in  Figure  5.16).  The  A 
score  average  was  0.09,  0.062,  and  0.124  and  maximum  value  was  1.0,  0.33,  and  0.5  for 
the  small,  medium,  and  large  topology  respectively.  The  results  indicate  that  scaling 
the  domain  size  has  little  impact  on  the  accuracy  of  B Rj,  or  A  with  respect  to  B t- 
A  slightly  greater  improvement  in  the  large  topology  was  seen,  attributed  to  the  rich 
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Figure  5.15.  Histogram  of  A  metric  for  the  peer-peer  setting. 
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Figure  5.16.  CDF  of  A  metric  for  the  peer-peer  setting. 
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Figure  5.17.  Histogram  of  C  metric  for  the  peer-peer  setting. 

number  of  cross-domain  web  service  connections. 

The  negative  accuracy  A  score  occurred  in  the  double-failure  scenario  {R37,  T26-37}, 
where  -P26-37  represents  the  point-to-point  link  between  routers  R26  and  R37.  Do¬ 
main  1  returned  the  cross-domain  link  {-P26-37}  as  the  best  explanation.  Domain 
2  returned  gateway  router  {-R37}  and  a  shared  attribute  for  the  cross-domain  link 
in  its  causal  graph.  Using  isolated  inference,  each  domain  correctly  identified  the 
failed  component  in  its  domain.  The  graph  digest  and  full  disclosure  approaches  both 
identified  {-R37}  as  the  best  explanation,  preferring  to  treat  evidence  of  failure  about 
P26-37  as  an  error  in  the  causal  graph  mapping. 

As  shown  in  Figures  5.17  and  5.18,  the  cost  metric  C  was  0.0  (no  cost)  for  all 
failure  scenarios.  The  results  mean  that  for  all  484  tested  failure  scenarios,  the  digest 
approach  achieved  the  same  inference  results  as  the  full  disclosure  approach. 

The  Z  values  using  the  Wilcoxon  Signed-Rank  Test  were  6.39,  6.19,  and  6.01 
for  the  small,  medium,  and  large  topologies  respectively.  These  results  each  provide 
95%  confidence  in  the  hypothesis. 
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Figure  5.18.  CDF  of  C  metric  for  the  peer-peer  setting. 


2.  Accuracy  Evaluation  Results  -  SCORE 

The  accuracy  improvement  metric  A  for  the  peer-peer  topologies  is  depicted 
in  Figures  5.19  and  5.20.  For  all  but  one  of  the  484  tested  cases,  ad  >  as,  resulting 
in  non-negative  accuracy  improvement  scores  A.  The  instance  with  a  negative  score 
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Figure  5.19.  Histogram  of  A  metric  for  the  peer-peer  setting. 


Figure  5.20.  CDF  of  A  metric  for  the  peer-peer  setting. 


occurred  in  the  same  failure  scenario,  with  the  same  details,  as  discussed  above  for 
SHRINK  in  Section  1.  In  the  tested  scenarios,  an  accuracy  improvement  of  22%, 
38%,  and  47%  was  observed  in  the  small,  medium,  and  large  topology  respectively, 
(highlighted  with  an  oval  in  Figure  5.20).  The  A  score  average  was  0.07,  0.08,  and 
0.15  and  the  maximum  value  was  1.0,  0.33,  and  1.0  for  the  small,  medium,  and  large 
topology  respectively.  The  results  indicate  a  trend  that  accuracy  A  increases  as  the 
domain  size  scales  up.  This  result  is  attributed  to  the  rich  number  of  cross-domain 
web  service  connections. 

As  shown  in  Figures  5.21  and  5.22,  the  cost  metric  C  was  0.0  (no  cost)  for  all 
failure  scenarios.  The  results  mean  that  for  all  484  tested  failure  scenarios,  the  digest 
approach  achieved  the  same  inference  results  as  the  full  disclosure  approach. 

The  Z  values  using  the  Wilcoxon  Signed-Rank  Test  were  5.44,  7.12,  and  6.51 
for  the  small,  medium,  and  large  topologies  respectively.  These  results  each  provide 
95%  confidence  in  in  the  hypothesis. 
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Figure  5.21.  Histogram  of  C  metric  for  the  peer-peer  setting. 
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Figure  5.22.  CDF  of  C  metric  for  the  peer-peer  setting. 


Small 

Medium 

Large 

Degree 

2.63  (4) 

4.64  (6) 

5.79  (7) 

Diameter 

3.04  (5) 

8.16  (10) 

22.48  (24) 

Routers 

6.61  (10) 

22.94  (26) 

119.24  (122) 

Table  5.5.  Privacy  metric  rMSE  versus  (true  value). 


Node 

Degree 

Domain 

Diameter 

Number 

Routers 

gSTD 

1.04 

0.62 

1.66 

E(X) 

1.46 

1.83 

3.19 

Table  5.6.  Privacy  metric  gSTD  versus  sample  mean. 

3.  Privacy  Evaluation  Results 

The  privacy  results  using  SHRINK  and  SCORE  were  identical  for  the  reasons 
discussed  in  Section  3.  The  results  below  apply  to  digests  created  for  both  SHRINK 
and  SCORE. 

The  root  mean  squared  error  results  are  shown  in  Table  5.5.  Since  the  rMSE 
values  are  high  relative  to  the  true  value,  the  information  about  the  sensitive  prop¬ 
erties  learned  from  the  attacks  results  are  generally  far  from  the  true  values.  The 
results  from  both  the  provider-customer  and  peer-peer  settings  are  encouraging,  and 
a  more  robust  digest  creation  algorithm  can  surely  improve  on  the  results  achieved 
by  the  prototype  algorithm. 

As  depicted  in  Table  5.6,  the  generalized  standard  deviation  for  each  privacy 
metric  was  low  compared  to  the  mean.  As  in  the  provider-customer  setting,  there 
was  little  variation  in  the  amount  of  information  learned  about  each  sensitive  prop¬ 
erty  from  each  digest  attack.  An  attacker  using  the  attack  heuristic  will  generally 
estimate  similar  sensitive  property  measurements  across  a  range  of  failure  scenarios 
and  topologies. 

Next,  additional  privacy  protection  data  is  provided  by  presenting  histograms 
of  the  raw  estimates  and  the  relative  error  of  the  attack  results  against  the  true  values 


for  each  sensitive  property. 


(a)  Physical  topology  (b)  Perceived  topology  after 

aggregation  of  B  and  D 


Figure  5.23.  Aggregation  affect  on  reachability. 


Internal  reachability  between  the  visible  gateways  was  revealed  in  4  of  the  484 
evaluated  peer-peer  failure  scenarios.  The  tested  gateways  are  three,  three,  and  two 
hops  distant  in  small,  medium,  and  large  topologies  respectively.  Each  instance  of 
revealing  the  reachability  occurred  in  the  medium  topology  between  gateways  R9  and 
R12,  which  are  3  hops  apart.  The  failed  components  were  not  in  the  neighborhood 
of  the  evaluated  gateway  routers  as  was  the  case  with  in  provider-customer  setting. 

In  each  case  of  revealed  reachability,  aggregation  by  the  digest  algorithm  (Fig¬ 
ure  4.1)  collapsed  nodes  and  edges  together  that  could  individually  reach  one  of  the 
gateways.  These  aggregated  nodes  effectively  created  a  bridge  to  establish  reacha¬ 
bility.  To  illustrate,  gateway  routers  A  and  C  connect  to  internal  routers  B  and  D 
respectively  as  shown  in  Figure  5.23(a).  Gateways  A  and  C  may  or  may  not  actually 
be  able  to  reach  each  other  internally.  If  nodes  B  and  D ,  representing  the  internal 
routers,  are  indistinguishable  to  the  digest  creation  algorithm  (Figure  4.1),  they  are 
aggregated  into  a  single  root  cause  node.  The  resulting  topology  after  conducting 
a  reverse-engineering  attack  on  the  digest  using  the  heuristic  in  Figure  4.23  returns 
the  topology  shown  in  Figure  5.23(b).  Digests  containing  revealed  reachability  would 
not  pass  the  local  security  check  shown  in  Figure  3.1  and  would  therefore  not  be 
distributed. 
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Figure  5.24.  Histogram  for  the  diameter  property. 


As  in  the  provider-customer  setting,  the  attack  heuristic  infers  a  narrow  range 
of  diameter  estimates,  irregardless  of  the  actual  domain  network  diameter  as  shown 
in  Figure  5.24.  The  relative  error  between  the  true  network  diameter  and  the  attack 
estimate  grew  with  the  topology  size  as  shown  in  Figure  5.25.  At  50%  mass  of  the 
experiments  the  relative  difference  was  approximately  60%,  80%,  and  96%  for  the 
small,  medium,  and  large  topologies  respectively. 

The  size  of  the  topology  had  little  bearing  on  the  estimated  number  of  routers 
as  seen  in  Figures  5.26  and  5.27. 

The  three  results  with  the  highest  inferred  values  were  9  routers  in  the  medium 
topology  and  two  results  that  found  10  routers  in  the  large  topology.  One  of  the  cases 
in  which  10  routers  were  revealed  in  the  large  topology  occurred  when  a  high  degree 
router  failed  in  Domain  2.  the  digest  creation  algorithm  (Figure  4.1)  generated  a 
failed  IP  link  observation  node  for  each  link  connected  to  the  failed  router.  This 
failure  scenario  ({7298,  Pi_2j-)  also  revealed  the  true  value  of  the  high  degree  node  in 
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Figure  5.25.  CDF  relative  error  for  the  diameter  property. 


Figure  5.28.  The  other  two  results  with  the  highest  inferred  values  occurred  in  double 
failure  scenarios  when  a  link  on  the  shortest  path  of  many  web  service  connections 
failed.  The  difference  between  the  true  value  and  attack  estimate  for  50%  mass  of 
the  experiments  (Figure  5.27)  was  70%,  88%,  and  98%  for  the  small,  medium,  and 
large  topologies  respectively. 

Privacy  protection  for  maximum  node  degree  scaled  slightly  in  the  peer-peer 
domain  relationship  (Figures  5.28  and  5.29)  due  to  several  nodes  of  higher  degree  in 
the  internal  topology  of  domain  D2.  The  node  aggregation  and  Noisy-OR  steps  of  the 
digest  creation  algorithm  contributed  to  hiding  the  true  value  of  the  highest-degree 
node  for  most  of  the  attacks,  and  only  1%  of  the  attacks  revealed  the  true  high  node 
degree.  The  difference  between  the  true  value  and  attack  estimate  for  50%  mass  of 
the  experiments  (Figure  5.29)  was  75%,  83%,  and  86%  for  the  small,  medium,  and 
large  topologies  respectively. 

Again  inherent  protection  for  the  evaluated  privacy  metrics  is  seen  that  re¬ 
turn  similar  results  irregardless  of  the  size  of  the  domain  topology.  A  stronger  digest 
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Figure  5.26.  Histogram  for  the  number  of  routers  property. 


Small 

Topology 

Medium 

Topology 

Large 

Topology 

E 

0.71 

1.29 

1.72 

Table  5.7.  SHRINK  scalability  results. 


algorithm  and  post-processing  of  a  digest  to  remove  any  information  over  a  predes¬ 
ignated  threshold  will  intuitively  strengthen  a  digest  against  entropy  loss  to  attack. 
Several  digests  were  created  that  failed  to  sufficiently  hide  network  sensitive  proper¬ 
ties,  and  these  digests  would  not  be  distributed.  The  node  aggregation  step  of  the 
digest  algorithm  (Figure  4.1)  strips  information  from  a  graph  digest,  but  with  some 
unintended  consequences  as  discussed  above.  A  more  robust  version  of  the  algorithm 
should  address  the  identified  shortcomings. 
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Figure  5.27.  CDF  relative  error  for  the  number  of  routers  property. 


Small 

Topology 

Medium 

Topology 

Large 

Topology 

E 

0.07 

0.18 

0.62 

Table  5.8.  SCORE  scalability  results. 


4.  Scalability  Evaluation  Results  -  SHRINK 

The  scalability  (speed)  improvement  for  the  peer-peer  domain  scenario  (Table 
5.7),  while  significant,  is  not  as  dramatic  as  that  observed  in  the  provider-customer 
setting.  In  the  peer-peer  scenario  the  domain  performing  inference  has  a  larger  struc¬ 
ture,  resulting  in  a  greater  number  of  hypotheses  for  the  inference  engine  to  consider. 
While  not  as  pronounced  as  in  the  provider-customer  setting,  the  running  time  savings 
are  still  significant. 

5.  Scalability  Evaluation  Results  -  SCORE 

As  in  the  provider- customer  setting,  scalability  E  (Table  5.8)  improvement 
is  not  as  dramatic  as  that  realized  with  SHRINK.  Clearly  though,  using  the  graph 
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Figure  5.28.  Histogram  for  the  node  degree  property. 


digest  approach  achieves  much  faster  inference  running  time  than  the  full  disclosure 
approach. 


C.  ABILENE-BASED  SETTING 

1.  Accuracy  Evaluation  Results 

For  all  but  5  of  377  tested  scenarios,  ad  >  as,  resulting  in  non- negative  ac¬ 
curacy  improvement  scores  A  (Figure  5.30).  The  average  score  was  0.25  for  both 
SHRINK  and  SCORE,  and  the  maximum  score  for  each  was  1.0.  There  was  an  ac¬ 
curacy  improvement  in  34%  and  68%  of  the  test  scenarios  for  SHRINK  and  SCORE 
respectively  (indicated  by  an  oval  in  Figure  5.31). 

All  instances  of  accuracy  A  =  1.0  in  Figure  5.30  reflect  failure  scenarios  for 
which  at  least  one  component  in  the  provider  domain  (Domain  1)  failed.  In  these 
failure  scenarios  all  failures  were  in  the  provider  domain,  or  one  failure  occurred 
in  the  provider  domain  and  either  a  point-to-point  link  or  a  stub  router  failed  in 
the  customer  domain.  In  the  latter  cases,  with  the  default  SHRINK  settings  a  best 
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Figure  5.29.  CDF  relative  error  for  the  node  degree  property. 


explanation  based  on  a  single  incorrect  SRG  dependency  mapping  may  have  a  higher 
posterior  probability  than  a  best  explanation  containing  one  more  failed  SRG. 

The  five  negative  accuracy  results  occurred  using  SCORE.  The  results  stem 
from  double-failure  scenarios  containing  a  gateway  in  the  customer  domain  and  a  non- 
adjacent  component  in  the  provider  domain.  Each  of  these  failure  scenarios  resulted 
in  all  three  leased  circuits  failing,  and  identifying  the  provider  as  the  root  cause  of 
all  observed  failures.  These  failure  scenarios  highlight  a  shortcoming  in  the  greedy 
heuristic  used  by  SCORE.  SHRINK,  which  returns  the  hypothesis  with  the  maximum 
posterior  probability,  correctly  identihed  the  failed  gateways  in  these  failure  scenarios. 

The  cost  metric  C  depicted  in  Figure  5.33,  shows  no  cost  in  using  the  graph 
digest  approach.  The  cost  to  inference  accuracy  equaled  zero  in  all  test  cases,  meaning 
that  the  digest  approach  achieved  the  same  accuracy  as  the  full  disclosure  approach 
in  all  tested  scenarios. 

The  Z  values  using  the  Wilcoxon  Signed- Rank  Test  were  6.39  and  9.18  for 
SHRINK  and  SCORE  respectively.  The  hypothesis  passes  the  95%  confidence  test 
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Figure  5.30.  Histogram  of  A  metric  for  the  Abilene-based  topology. 
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Figure  5.31.  CDF  of  A  metric  for  the  Abilene-based  topology. 
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Figure  5.32.  Summary  of  A  metric  results  for  the  Abilene-based  topology. 
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Figure  5.33.  Histogram  of  C  metric  for  the  Abilene-based  topology. 


Degree 

4.02  (6) 

Diameter 

6.56  (8) 

Routers 

67.36  (71) 

Table  5.9.  Privacy  metric  rMSE  versus  (true  value). 


Node 

Degree 

Domain 

Diameter 

Number 

Routers 

gSTD 

1.41 

1.00 

2.23 

E(X) 

2.24 

1.52 

3.68 

Table  5.10.  Privacy  metric  gSTD  versus  sample  mean. 

for  both  inference  algorithms. 

2.  Privacy  Evaluation  Results 

The  rMSE  values  of  the  attack  estimates  were  close  to  the  true  hidden  values 
as  depicted  in  Table  5.9.  The  gSTD  values  were  low  compared  to  the  mean  attack 
values  as  shown  in  Table  5.10.  These  results  indicate  that  the  digest  attacks  were 
unsuccessful  at  revealing  the  hidden  values  for  the  sensitive  properties. 

Histograms  and  CDFs  of  the  attack  estimates  for  each  sensitive  property,  less 
reachability,  are  presented  in  Figures  5.34  -  5.39.  The  reachability  test  was  conducted 
between  gateway  routers  i?4  and  R7  in  Figure  4.17.  The  gateways  are  3  hops  apart  via 
internal  links,  and  none  of  the  159  digests  revealed  the  hidden  reachability  between 
these  two  gateway  routers. 

The  network  diameter  values  measured  by  the  attack  are  presented  in  Figure 
5.34.  At  50%  mass  of  the  experiments  (Figure  5.35),  the  relative  error  is  approxi¬ 
mately  88%. 

A  histogram  showing  the  number  of  routers  found  by  the  attack  heuristic  is 
presented  in  Figure  5.36.  At  50%  mass  of  the  experiments  (Figure  5.37),  the  relative 
error  is  approximately  97%.  Intuitively  this  results  makes  sense  since  each  digest 
only  provides  a  small  collection  of  nodes.  In  general  the  topology  learned  consists  of 
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Figure  5.34.  Histogram  for  the  diameter  property. 
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Figure  5.35.  CDF  relative  error  \XT\  for  the  diameter  property. 
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Figure  5.36.  Histogram  for  the  number  of  routers  property. 


SHRINK 

SCORE 

E 

2.47 

0.24 

Table  5.11.  SHRINK  and  SCORE  scalability  results. 

a  neighborhood  around  one  or  two  routers,  and  multiple  failures  whose  neighborhoods 
intersect  allow  a  larger  portion  of  the  topology  to  be  inferred. 

A  histogram  depicting  the  maximum  node  degree  attack  results  is  shown  in 
Figure  5.38.  At  50%  mass  of  the  experiments  (Figure  5.39),  the  relative  error  was 
approximately  67%. 

3.  Scalability  Evaluation  Results 

To  compute  scalability  E  the  average  elapsed  real  time  to  compute  SHRINK 
and  SCORE  results  for  up  to  three  failures  was  measured  on  the  Abilene-based  topol¬ 
ogy.  The  mean  elapsed  real  time  of  the  Erst  ten  double  failure  scenarios  for  both  full 
disclosure  inference  and  for  the  graph  digest  approach  was  used  to  compute  E. 

As  expected  the  graph  digest  approach  enabled  significant  time  savings  while 
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Figure  5.37.  CDF  relative  error  for  the  number  of  routers  property. 


performing  inference  with  SHRINK.  Although  not  as  dramatic,  the  graph  digest  ap¬ 
proach  enabled  faster  inference  using  the  SCORE  algorithm  as  well.  These  results 
are  consistent  with  the  scalability  results  on  the  synthetic  topologies  constructed  with 
network  topology  atoms. 

D.  CONCLUSION 

This  chapter  provides  the  experimental  results  using  the  evaluation  method¬ 
ology  outlined  in  Chapter  IV.  For  each  of  the  seven  tested  topologies  and  for  both 
implemented  intra-domain  fault  localization  algorithms,  the  hypothesis  of  accuracy 
improvement  was  supported  with  a  95%  confidence  level.  Although  the  hypothesis 
testing  was  conducted  for  95%  confidence,  all  fourteen  hypothesis  tests  would  have 
passed  at  99.95%  confidence. 

A  summary  of  the  accuracy  results  for  the  synthetic  topologies  is  presented  in 
Figure  5.40.  In  total,  there  were  814  failure  scenarios  in  which  using  the  graph  digest 
approach  improved  accuracy  in  finding  cross-domain  faults. 
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Figure  5.38.  Histogram  for  the  node  degree  property. 


In  general,  both  SHRINK  and  SCORE  performed  well  at  Ending  faults.  When 
a  highly  connected  device  failed,  all  approaches  typically  correctly  identified  the  failed 
component  using  SHRINK  or  SCORE.  Using  the  full  disclosure  approach,  SCORE 
with  a  mean  accuracy  of  0.97  outperformed  SHRINK  with  a  mean  accuracy  of  0.88  . 
Using  the  graph  digest  approach,  SHRINK  with  a  mean  A  score  of  0.21  outperformed 
SCORE  with  a  mean  A  score  of  0.19.  Using  the  default  SHRINK  settings,  SHRINK 
assigns  a  lower  prior  probability  of  failure  to  a  device  than  the  probability  that  a 
conditional  dependency  mapping  is  incorrect.  SCORE  tended  to  outperform  SHRINK 
when  a  stub  router  or  point-to-point  link  failed.  SHRINK  with  its  built-in  robustness 
to  errors  in  causal  graphsperformed  better  at  performing  inference  on  transformed 
data  that  included  a  graph  digest. 

The  privacy  results  show  measurable  protection  of  the  sensitive  properties 
evaluated.  The  estimate  means  were  generally  far  from  the  true  values  with  little 
deviation  in  the  estimates.  Clearly  some  digests,  such  as  those  revealing  reachability, 
would  fail  local  security  policy  checks.  These  digests  would  not  be  sent  as  discussed  in 
Chapter  III,  Section  B.  The  privacy  results  for  the  synthetic  topologies  and  empirical 
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Figure  5.39.  CDF  relative  error  for  the  node  degree  property. 


Small 

Medium 

Large 

Abilene-Based 

Degree 

2.11  (4) 

3.28  (5) 

4.33  (6) 

4.02  (6) 

Diameter 

3.27  (4) 

9.95  (11) 

22.01  (23) 

6.56  (8) 

Routers 

12.06  (15) 

47.71  (51) 

197.97  (201) 

67.36(71) 

Table  5.12.  Summary  of  provider-customer  rMSE  versus  (true  value). 


topology  are  similar,  strengthening  an  argument  for  applicability  of  the  graph  digest 
approach  to  a  broad  range  of  networks.  A  summary  of  privacy  rMSE  and  gSTD 
results  for  the  provider-customer  topologies  is  presented  in  Tables  5.12  and  5.13. 

One  of  the  surprises  from  the  experiments  is  that  the  proposed  metric  gST D  is 
much  more  effective  than  expected  at  gauging  the  performance  of  the  digest-creation 
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Number 

Routers 

gSTD 

1.20 

0.74 

1.85 

E(X) 

2.08 

1.11 

3.31 

Table  5.13.  Summary  of  provider-customer  gSTD  versus  sample  mean. 
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Pie  Chart  Showing  Accuracy  Metric  A  Summary 
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Figure  5.40.  Summary  of  A  metric  for  the  synthetic  topologies. 

algorithm  in  hiding  the  values  of  sensitive  properties.  While  further  investigations 
are  required  to  validate  the  generality  of  such  effectiveness,  the  results  give  weight  to 
further  investigation  into  the  approach. 

A  decrease  in  running  time  was  demonstrated  by  using  the  graph  digest  ap¬ 
proach  versus  full  collaboration,  achieving  positive  scalability  results  for  the  approach. 

The  experimental  results  reaffirm  the  observation  that  more  research  efforts 
in  this  space  are  needed.  In  all  topologies  simulated,  a  number  of  scenarios  where 
domains  cannot  troubleshoot  effectively  in  isolation  were  discovered  .  A  large  portion 
of  the  real  world  scenarios  are  expected  to  be  more  complicated  than  those  evaluated 
in  this  research.  Therefore,  the  need  for  cross-domain  solutions  is  real. 
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VI. 


CONCLUSIONS 


This  chapter  first  provides  the  main  conclusions  from  the  research,  and  then 
discusses  future  work  identified  in  the  course  of  the  research. 

A.  RESEARCH  CONCLUSIONS 

This  research  demonstrated  that  by  correlating  risks  and  observations  from 
different  domains,  cross-domain  fault  localization  has  the  potential  to  significantly 
increase  the  accuracy  of  network  fault  localization,  ft  also  articulated  the  main  chal¬ 
lenges  to  realize  inference  accuracy  gain,  particularly  the  privacy  consideration.  The 
main  contributions  are  a  framework  with  explicit  metrics  to  evaluate  a  cooperative 
design  in  the  design  space,  and  an  inference-graph-digest  based  formulation  of  the 
problem.  The  graph  digest  approach  also  facilitates  the  re-use  of  existing  fault  local¬ 
ization  algorithms  without  compromising  each  domain’s  information  hiding  policy. 

The  evaluation  supported  the  hypothesis  with  95%  confidence  for  all  14  evalu¬ 
ated  data  sets:  It  is  possible  to  construct  a  framework  to  enable  managers  of  separate 
network  domains  to  share  information  and  achieve  inference  gain  while  quantifying 
privacy  preservation  of  sensitive  information. 

This  research  presents  the  first  comprehensive  evaluation  of  the  feasibility 
of  cross-domain  fault  localization.  The  evaluation  is  systematic  and  complete  with 
regarding  to  all  the  proposed  performance  metrics. 

The  goal  was  to  answer  the  following  overarching  questions: 

1.  Does  cross-domain  fault  localization  offer  the  kinds  of  benefits  warranting  fur¬ 
ther  research? 

2.  Can  it  provide  deployable  and  acceptable  privacy  protection  with  manageable 
complexity? 

The  answer  to  both  of  these  questions  was  a  strong  “Yes” .  Cross-domain  fault 
localization,  using  the  prototype  design,  performed  quite  well  at  finding  the  faults 
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in  all  failure  scenarios.  Of  course,  in  practice  not  using  a  design  that  balances  the 
requirements  of  accuracy  and  privacy  is  a  non-starter  —  domain  administrators  will 
be  simply  unwilling  to  reveal  their  complete  topologies.  This  leads  to  the  second 
question.  The  digest  approach  did  provide  significant  performance  gains  compared 
with  localization  performed  in  isolation  while  measurably  protecting  the  sensitive 
properties  tested.  The  use  of  a  design  that  enables  sharing  summary  information 
dramatically  increases  the  deployability  of  cross-domain  fault  localization  by  decreas¬ 
ing  inference  time  by  two  to  three  orders  of  magnitude. 

While  providing  a  positive  answer  to  both  high-level  questions,  the  evalua¬ 
tion  also  reveals  several  opportunities  for  further  research  and  enhancement  includ¬ 
ing  richer  causal  graph  models  and  better  digest  algorithms.  This  underscores  the 
importance  of  having  a  repeatable  evaluation  methodology. 

B.  FUTURE  WORK 

To  move  forward  certainly  requires  a  fundamental  understanding  of  the  issues 
beyond  the  framework,  approach,  and  scenarios  described  in  this  dissertation.  Is 
the  graph  digest  approach  applicable  to  a  wide  range  of  network  scenarios?  What 
about  scenarios  involving  more  than  two  domains?  Does  there  exist  a  general,  yet 
easily  calculable  metric  for  quantifying  the  highly  domain-specific  information  hiding 
policy?  How  should  observation  errors  and  graph  model  inaccuracies  be  detected  and 
controlled?  These  and  other  similar  questions  constitute  a  new  area  of  networking 
research  which  may  have  a  major  impact  on  network  fault  trouble-shooting  practices. 
In  light  of  this  work,  a  prioritized  list  of  future  work  follows. 

•  Beyond  visual  checks  of  “Does  this  seem  reasonable?” ,  this  research  did  not  val¬ 
idate  how  well  the  scenarios  capture  issues  faced  in  real  world  practice.  There 
is  little  or  no  publicly  available  data  to  allow  this  validation.  Further  testing  on 
real  network  topologies  must  be  done  as  a  logical  next  step  to  validate  the  util¬ 
ity  of  the  framework  and  graph  digest  approach.  Obtaining  troubleshooting 
records  and  topology  from  one’s  own  domain  is  challenging  enough.  Collecting 
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such  sensitive  data  from  multiple  domains  is  almost  impossible.  There  may 
be  hope  as  data  collection  with  fault  logging  by  collaborative  projects  such 
as  Abilene  and  GEANT  would  provide  a  much  needed  evaluation  dataset  for 
cross-domain  fault  localization,  and  fault  localization  in  general. 

•  Distributions  accurately  representing  an  adversary’s  beliefs  about  sensitive 
properties  that  model  general  and  specific  events  would  allow  application  of 
the  KL  distance  to  evaluate  privacy  protection.  While  this  may  remain  a 
lofty  goal  for  many  sensitive  properties,  any  such  distributions  that  can  be 
established  will  enable  use  of  the  ideal  metric  for  privacy  protection. 

•  Provable  practical  privacy  risk  metrics  are  needed.  The  presented  rMSE 
and  gSTD  practical  privacy  metrics  provide  a  sound  methodology,  but  their 
strengths  and  limitations  still  need  to  be  proven.  Domain  managers  will  be 
reluctant  to  use  any  cooperative  design  without  proven  bounds  on  the  risk  to 
sensitive  properties. 

•  An  ontology  of  sensitive  properties  with  privacy  protection  implementation 
methods  is  needed.  While  sensitive  properties  may  vary  between  domains, 
there  may  a  core  set  that  most  domain  managers  would  agree  comprise  the 
majority  of  these  properties.  Identifying  the  common  sensitive  properties  is  a 
logical  first  step  needed  in  order  to  develop  robust  protection  against  disclo¬ 
sure. 

•  Some  inherent  bias  is  acknowledged  in  attacking  the  digests  using  the  at¬ 
tack  heuristics  developed  in  this  research.  The  entire  network  structure  of 
the  undigested  causal  graphs  is  revealed  using  the  attack  heuristics,  however, 
indicating  a  sound  baseline  attack  method.  While  a  more  thorough  attack 
strategy  is  needed,  it  may  be  possible  to  determine  the  effectiveness  and  limi¬ 
tations  of  the  presented  heuristic.  A  variant  of  the  heuristic  may  be  a  valuable 
component  of  such  a  future  attack  strategy.  This  is  just  one  of  many  reasons 
why  rigorous  analysis  and  proofs  of  privacy  are  needed  before  the  graph  digest 
approach  can  be  widely  adopted  when  privacy  is  a  concern. 

•  The  current  algorithm  for  constructing  digests  incorporates  network  domain 
knowledge.  Techniques  from  the  artificial  intelligence  and  statistics  communi¬ 
ties  for  approximating  statistical  distributions  could  be  leveraged  to  produce 
smaller  and  more  accurate  digests.  Since  performing  fault  localization  with  di¬ 
gests  is  significantly  faster  than  without,  perhaps  digests  can  be  used  internally 
in  very  large  domains  to  yield  faster  inference. 

For  example,  instead  of  allowing  a  digestion  algorithm  to  produce  variable 
sized  digest  causal  graphs,  the  size  and/or  structure  of  the  graph  digest  a 
prior  may  be  constrained,  similar  to  the  way  in  which  a  secure  hash  function 
has  a  predefined  fixed  width  in  bits  (e.g.,  512  bits  for  SHA-512)  for  all  hash 
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values.  The  primary  advantage  of  this  approach  is  that  the  gSTD  would  be 
small  regardless  of  the  scenarios  used.  However,  this  approach  also  brings  up 
a  challenge.  By  restricting  the  size  and  structure  of  a  graph  digest,  it  might 
be  difficult  to  encode  within  it  sufficient  information  to  support  inference  for 
large  scenarios. 

•  Observation  errors  or  missing  observations  when  evaluating  inference  accu¬ 
racies  were  not  modeled  in  this  research.  Such  events  are  common  in  the 
real-world  due  to  software  bugs  or  misconfigurations.  Follow-on  work  should 
evaluate  the  performance  of  the  graph  digest  approach  in  a  noisy  observation 
environment.  These  errors  are  expected  to  similarly  impact  all  discussed  ap¬ 
proaches  and,  therefore,  introduce  very  small  perturbations  to  the  A  and  C 
metrics. 

•  The  core  network  causal  graph  model  (SHRINK  [19])  has  a  very  simple  struc¬ 
ture.  The  structure  has  the  advantage  of  easy  inference  but  lacks  expressive¬ 
ness.  In  particular,  the  bipartite  nature  makes  compositing  levels  difficult.  The 
Sherlock  [4]  work  gives  a  more  expressive  model  without  sacrificing  SHRINK’s 
inference  speed  advantages.  Expanding  the  expressive  power  of  the  causal 
graph  model  requires  new  algorithms  for  specifying  shared  attributes,  combin¬ 
ing  graphs  and  for  constructing  digests. 

•  Finally,  in  addition  to  the  development  of  better  metrics  and  algorithms,  an 
emphasis  should  be  placed  on  the  creation  of  new  theories  for  reasoning  about 
what  can  and  cannot  be  achieved  in  balancing  the  trade-off  between  inference 
accuracy  and  privacy  protection.  Appropriate  mechanisms,  trust  models,  and 
policy  must  also  be  developed  to  support  the  exchange  of  causal  graph  digests 
and  other  relevant  information  (e.g.,  shared  attributes)  between  domains  in 
collaboration. 
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